How do I convert regex to grep format - regex

I'm trying to make what I'm able to match in Notepad++ using Regular Expression, have the ability to be grepped. I want to match email:32characters(a-f0-9):3characters(ANYCHARACTER/SYMBOL)
Here's an example:
Stack#overflow.com:999999999999999999999999999999a1:&U,
So far I've been able to match using regex using:
[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}:[a-f0-9]{32}:
But i'm unsure on how to match the last 3 characters (WHICH CAN BE ANYTHING).
Furthermore, when trying to:
grep "[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}:[a-f0-9]{32}:" input.txt > output.txt
Nothing is being outputted to my file which seems strange to me. I am using Cygwin Terminal on Windows to perform these greps.

Use grep -E or egrep if that is available in your environment.
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e. force grep to behave as egrep).

Related

How can I translate a regex within vim to work with sed?

I have a string that exists within a text file that I am trying to modify with regex.
"configuration_file_for_wks_33-40"
and I want to modify it so that it looks like this
"configuration_file_for_wks_33-40_6ks"
Within vim I can accomplish this with the following regex command
%s/33-\(\d\d\)/33-\1_6ks/
But if I try to pass that regex command to sed such as
sed 's/33-\(\d\d\)/33-\1_6ks/' input_file.json
The string is not changed, even if I include the -e parameter.
I have also tried to do this using ex as
echo '%s/33-\(\d\d\)/33-\1_6ks/' | ex input_file.json
If I use
sed 's/wks_33-\(\d\d\)*/wks_33-\1_6ks/' input_file.json
then I get
configuration_file_for_wks_33-_6ks40
For that, I've tried various different escaping patterns without any luck.
Can someone help me understand why this changes are not working?
vim has a different syntax for regular expressions (which is even configurable). Unfortunately, sed doesn't understand \d (see https://unix.stackexchange.com/a/414230/304256). With -E, you can match digits with [0-9] or [[:digit:]]:
$ sed -E 's/33-[0-9][0-9]/&_6ks/'
configuration_file_for_wks_33-40_6ks
Note that you can use & in the replacement for adding the entire matched string.
So why is this:
$ sed 's/wks_33-\(\d\d\)*/wks_33-\1_6ks/' input_file.json
configuration_file_for_wks_33-_6ks40
Here, (\d\d)* is simply matched 0 times, so you replace wks_33- by wks_33-_6ks (\1 is a zero-length string) and 40 remains where it was before.
Translation from one language to another is best done with some reference material on hand:
sed BRE syntax
sed ERE syntax
sed classes
sed RE extensions
The superficial reading of which shows that sed doesn't support \d.
Possible alternatives to \d\d:
[[:digit:]]\{2\}
[0-9]\{2\}
How can I translate a regex within vim to work with sed?
Since you write "a regex", I think you refer to any regex.
Translating a Vim regex to a Sed regex is not always possible, because a Vim regex can have lookarounds, whereas a Sed regex has no such things.

Regex for prosodically-defined words: working in Atom but not grep

I'm trying to search a .txt dictionary for all trisyllabic roots, and then have the matching roots passed to a new .txt file. The dictionary in question is a raw text version of Heath's Nunggubuyu dictionary. When I search the file in Atom (my preferred text editor), the following string does a pretty good job of singling out the desired roots and eliminating any material from the definitions below the headwords (which begin with whitespace), as well as any English words, and any trisyllabic strings interrupted by a hyphen or equals sign (which mean they are not monomorphemic roots). Forgive me if it looks clunky; I'm an absolute beginner. (In this orthography, vowel length is indicated with a ':', and there are only three vowels 'a,i,u'. None of the headwords have uppercase letters.)
^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b
However, I need the matched strings to be output to a new file. When I try using this same string in grep (on a Mac), nothing is matched. I use the syntax
grep -o "^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b" Dict-nofrontmatter.txt > output.txt
I've been searching for hours trying to figure out how to translate from Atom's regex dialect to grep (Mac), to no avail. Whenever I do manage to get matches, the results looks wildly different to what I expect, and what I get from Atom. I've also looked at some apparent grep tools for Atom, but the documentation is virtually non-existent so I can't work out what they even do. What am I getting wrong here? Should I try an alternative to grep?
grep supports different regex styles. From man re_format:
Regular expressions ("RE"s), as defined in POSIX.2, come in two
forms:
modern REs (roughly those of egrep; POSIX.2 calls these extended REs) and
obsolete REs (roughly those of ed(1); POSIX.2 basic REs).
Grep has switches to choose which variant is used. Sorted from less to many features:
fixed string: grep -F or fgrep
No regex at all. Plain text search.
basic regex: grep -G or just grep
|, +, and ? are ordinary characters. | has no equivalent. Parentheses must be escaped to work as sub-expressions.
extended regex: grep -E or egrep
"Normal" regexes with |, +, ? bounds and so on.
perl regex: grep -P (for GNU grep, not pre-installed on Mac)
Most powerful regexes. Supports lookaheads and other features.
In your case you should try grep -Eo "^\S....
Possibly the only thing missing from your grep command is the -E option:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
grep -Eo "$regex" Dict-nofrontmatter.txt > output.txt
-E activates support for extended (modern) regular expressions, which work as one expects nowadays (duplication symbols + and ? work as expected, ( and ) form capture groups, | is alternation).
Without -E (or with -G) basic regular expressions are assumed - a limited legacy form that differs in syntax. Given that -E is part of POSIX, there's no reason not to use it.
On macOS, grep does understand character-class shortcuts such as \S and \W, and also word-boundary assertions such as \b - this is in contrast with the other BSD utilities that macOS comes with, notably sed and awk.
It doesn't look like you need it, but PRCEs (Perl-compatible Regular Expressions) would provide additional features, such as look-around assertions.
macOS grep doesn't support them, but GNU grep does, via the -P option. You can install GNU grep on macOS via Homebrew.
Alternatively, you can simply use perl directly; the equivalent of the above command would be:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
perl -lne "print for m/$regex/g" Dict-nofrontmatter.txt > output.txt

How to get sed to take extended regular expressions?

I want to do string replacement using regular expressions in sed. Now, I'm aware that the behavior of sed is funky on a Mac. I've often seen workarounds using egrep when I want to just examine a certain pattern in a line. But, in this case I want to do string replacement.
I want to replace cp an and cp <tab or newline> an with gggg. I tried the following, which would work under extended regular expressions:
sed -i'_backup' 's/cp\s+an/gggg/g'
But of course this does nothing. I tried egrepping, and of course it picks out the lines with cp <one or more space characters> an.
How do I get sed to do replacement using extended regular expressions? Or what is a better way to do replacement using regular expressions?
i'm on mac osx.
On OSX following command will work for extended regex support:
sed -i.backup -E 's/cp[[:blank:]]+an/gggg/g'
POSIX Character Class Reference
Since you mentioned you want <newline> to be handled, you'll need to coax sed a bit. Your exact requirements aren't too clear to me but the following example illustrates that sed can easily handle certain cases in which a newline is in the "target" regex:
$ echo $'cp\nancp an' | sed -E '/cp/{N; s/cp(\n|[[:blank:]])an/gggg/g;}'
gggggggg
(Note to non-Mac readers: If your grep does not support -E, try -r instead.)

Trying to remove version number from a string using sed in OSX

I have what I hope is a simple issue which is stumping me. I need to take an installer file with a name like:
installer_v0.29_linux.run
installer_v10.22_linux_x64.run
installer_v1.1_osx.app
installer_v5.6_windows.exe
and zip it up into a file with the format
installer_linux.zip
installer_linux_x64.zip
installer_osx.zip
installer_windows.zip
I already have a bash script running on OSX which does almost everything else I need in the build chain, and was certain I could achieve this with sed using something like:
ZIP_NAME=`echo "$OUTPUT_NAME" | sed -E 's/_(?:\d*\.)?\d+//g'`
That is, replacing the regex _(?:\d*\.)?\d+ with a blank - the regex should match any decimal number preceded by an underscore.
However, I get the error RE error: repetition-operator operand invalid when I try to run this. At this stage I am stumped - I have Googled around this and can't see what I am doing wrong. The regex I wrote works correctly at Regexr, but clearly some element of it is not supported by the sed implementation in OSX. Does anyone know what I am doing wrong?
You can try this sed:
sed 's/_v[^_]*//; s/\.[[:alnum:]]\+$/.zip/' file
installer_linux.zip
installer_linux_x64.zip
installer_osx.zip
installer_windows.zip
You don't need sed, just some parameter expansion magic with an extended pattern.
shopt -s extglob
zip_name=${OUTPUT_NAME/_v+([^_])/}
The pattern _v+([^_]) matches a string starting with _v and all characters up to the next _. The extglob option enables the use of the +(...) pattern to match one or more occurrences of the enclosed pattern (in this case, a non-_ character). The parameter expansion ${var/pattern/} removes the first occurrence of the given pattern from the expansion of $var.
Try this way also
sed 's/_[^_]\+//' FileName
OutPut:
installer_linux.run
installer_linux_x64.run
installer_osx.app
installer_windows.exe
If you want add replace zip instead of run use below method
sed 's/\([^_]\+\).*\(_.*\).*/\1\2.zip/' Filename
Output :
installer_linux.run.zip
installer_x64.run.zip
installer_osx.app.zip
installer_windows.exe.zip

How to use regex OR in grep in Cygwin?

I need to return results for two different matches from a single file.
grep "string1" my.file
correctly returns the single instance of string1 in my.file
grep "string2" my.file
correctly returns the single instance of string2 in my.file
but
grep "string1|string2" my.file
returns nothing
in regex test apps that syntax is correct, so why does it not work for grep in cygwin ?
Using the | character without escaping it in a basic regular expression will only match the | literal. For instance, if you have a file with contents
string1
string2
string1|string2
Using grep "string1|string2" my.file will only match the last line
$ grep "string1|string2" my.file
string1|string2
In order to use the alternation operator |, you could:
Use a basic regular expression (just grep) and escape the | character in the regular expression
grep "string1\|string2" my.file
Use an extended regular expression with egrep or grep -E, as Julian already pointed out in his answer
grep -E "string1|string2" my.file
If it is two different patterns that you want to match, you could also specify them separately in -e options:
grep -e "string1" -e "string2" my.file
You might find the following sections of the grep reference useful:
Basic vs Extended Regular Expressions
Matching Control, where it explains -e
You may need to either use egrep or grep -E. The pipe OR symbol is part of 'extended' grep and may not be supported by the basic Cygwin grep.
Also, you probably need to escape the pipe symbol.
The best and most clear way I've found is:
grep -e REG1 -e REG2 -e REG3 _FILETOGREP_
I never use pipe as it's less evident and very awkward to get working.
You can find this information by reading the fine manual: grep(1), which you can find by running 'man grep'. It describes the difference between grep and egrep, and basic and regular expressions, along with a lot of other useful information about grep.