If this question has been asked and answered before, my apologies. I couldn't find anything from looking.
How can I use linux grep / regex to find unknown characters in an email address? For example, let's say we had this list:
userone:123456#example.com
usertwo:123#example.com
userthree:12#example.com
how could I grep the list to find emails matching ***#example.com?
(the only email that should be found from this is 123#example.com)
I'm aware that grep -e '...#example\.com' would work, but periods can represent any characters in grep, so doing this would also find :12#example.com. Plus, MOST email address don't contain just any character, they are typically confined to letters, numbers, periods, and underscores (many email providers don't allow anything else)
I need to use something else besides a period symbol in grep, something like [a-Z0-9._] so that letters, numbers, periods, and underscores are included but nothing else. I'm unsure of how to go about this. Thanks
EDIT: What I've tried so far:
grep -e '[a-zA-Z0-9_.]{3}#example.com' *. This doesn't work, so it comes down to just me getting the regex wrong.
If the email addresses are always preceded by a username, which is then followed by a colon or a space and then the email address, you can use that knowledge to restrict your matches.
What does a username look like? You need to know if you're going to use it to find matches. Let's say for now it is letters, numbers, dash, and underscore, it always starts with a letter, and is from 2 to 12 characters long. We also know it's got a colon or space after it. The regex for that is
[A-Za-z][A-Za-z0-9_-]{1,11}[: ]
That would be followed by your email address which, it sounds like, is something you decide on and input because that's what you're looking for at the moment.
Your example of test*****#example.com would be
[A-Za-z][A-Za-z0-9_-]{1,11}[: ]test.\+#example.com
or, if exactly 5 chars after "test"
[A-Za-z][A-Za-z0-9_-]{1,11}[: ]test.....#example.com
Your original sample ***#example.com is "any 3-char address at example.com" and would be
[A-Za-z][A-Za-z0-9_-]{1,11}[: ]...#example.com
This would be a pain to retype that prefix all the time, so you'd want to wrap that in a script that uses prefix + what_i_typed as the pattern.
try this command line i used to found any thing in any files
grep -r -i #example.com ./
Related
I have some scripts which have many email address in different domain (say domain1.com,domain2.com . I want to replace all of them with some generic email address in a common domain, say domain.com, keeping rest of the script same.
I am using below in sed but it doesn't seem to work. (it is returning same output as input, so looks like the search is not matching. However, when I tested the regex \S+#\S+/ in online tester, it seems to match email addresses.)
s/\S+#\S+/genericid#domain.com/g
For example, I have 2 scripts
$ cat script1.sh
abcd.efg#domain.com
export SENDFROM="xyz#domain1.com" blah_4
$ cat script2.sh
echo foo|mailx -s "blah" pqr#domain2.com,def#domain.com,some#domain.com
omg#domain.com
foo abc#domain.com bar
My result after sed -i should be
$ cat script1.sh
genericid#domain.com
export SENDFROM="genericid#domain.com" blah_4
$ cat script2.sh
echo foo|mailx -s "blah" genericid#domain.com,genericid#domain.com,genericid#domain.com
genericid#domain.com
foo genericid#domain.com bar
I am using Linux 3.10.0-327.28.2.el7.x86_64
Any suggestion please?
Update:
I managed to make it work with 's/\S\+#\S\+.com/genericid#domain.com/g'. There were 2 problem with previous search.
The + needed \ before it.
As file had other # lines (for
database connections), I had to append .com at the end, as all my
addresses ended in .com
Capturing email adresses using regex can be more difficult than it seems.
Anyhow, for replacing the domain, I think you could simplistically consider that an email domain starts when it find:
1 alphanum char + # + N alphanum chars + . + N alphanum chars
Based on this preconception, in javascript I would do so:
(\w#)(\w*.\w*)
Replacing with:
$1newdomain.com
Hope it helps you.
UPDATE - Other answers, and comments on this one, point out that you may have to take extra steps to enable shorthand character class matching; I'm used to doing regex in perl, where this just works, so didn't think to address that possibility. This answer only addresses how to improve the matching once you have the regex functioning.
--
While the problem of matching email addresses with regex can be very complex (and in fact in the most general case isn't possible with true regex), you probably can handle your specific case.
The problem with what you have is that \S matches any non-whitespace, so address#something.com,address#somethingelse.com, where two addresses have no whitespace between, matches incorrectly.
So there's a couple ways to go about it, based on knowing what sorts of addresses you realistically will see. One solution would be to replace both instances of \S with [^\s,] (note the lowercase s), which simply excludes , from the match as well as whitespace.
Try this
sed s/[^,#]*#[^,]*/genericid#domain.com/g
and
echo 'pqr#domain2.com,def#domain.com,some#domain.com' | sed s/[^,#]*#[^,]*/genericid#domain.com/g
result
genericid#domain.com,genericid#domain.com,genericid#domain.com
Still UNIX-related, though requiring the more modern and far from ubiquitous tool, Ammonite, you could use email-replace.
$ amm path/to/email-replace.sc <random integer seed> <file1 with emails> <file2 with emails> ...
DISCLAIMER: the matcher is likely far from perfect, so use at your own risk, and always have backups available.
Note that by default it replaces emails with a new random e-mail address. To use a fixed email address, just replace the call to randEmail with a constant string.
I'm frustrated trying to find out how to use regex to do anything useful. I'm completely uncertain on everything that I do, and I've resorted to trial and error; which has not been effective.
I'm trying to list files in the current directory that starts with a letter, contains a number, end with a dot followed by a lowercase character, etc.
So I know starts with a letter would be:
^[a-zA-Z]
but I don't know how to follow that up with CONTAINS a number. I know ends with a dot can be [\.]*, but I'm not sure. I'm seeing that $ is also used to match strings at the end of the word.
I have no idea if I should be using find with regex to do this, or ls | grep .... I'm completely lost. Any direction would be appreciated.
I guess the specific question I was trying to ask, was how to I glue the expressions together. For example, I tried ls | grep ^[a-zA-Z][0-9] but this only shows files that start with letter, followed by a number. I don't know how write a regex that starts with a letter, and then add the next requirement, ie. contains a number.
Starts with a letter: ^[a-zA-Z]
Contains a number: .*[0-9].*
Ends with a dot and lowercase letter: \.[a-z]$
Together:
^[a-zA-Z].*[0-9].*\.[a-z]$
The best way to find files that match a regex is with find -regex. When you use that the ^ and $ anchors are implied, so you can omit them. You'll need to tack on .*/ at the front to match any directory, since -regex matches both the directory and file name.
find -regex '.*/[a-zA-Z].*[0-9].*\.[a-z]'
There's plenty of documentation online, eg. GNU's Reference Manual.
Your particular example, would require something like:
^[:alpha:].*[:digit:].*\.[:lower:]$
or if POSIX classes are not available:
^[a-zA-Z].*[0-9].*\.[a-z]$
You can read either as:
start of line
a letter (upper or lower case)
any character, zero or more times
a digit
any character, zero or more times
a dot (must be escaped with a backslash)
a lower case letter
end of line
Once you settle on a regular expression, you can use it with ls within the directory you wish to find the files in:
cd <dir>
ls -1 | grep '^[a-zA-Z].*[0-9].*\.[a-z]$'
NOTE: I tried to improve my answer based on some of the comments.
I have two groups of strings that take the formats
http://example.com/foo/something
and
http://example.com/foo/something/something-else/bar/1
Where example.com, foo and bar are fixed, something and something else could be any string and 1 is any number.
I want to use regex to match strings following the first format (they must start with http://example.com/foo/) and not the second. The exclusion could be around number of slashes, the "bar" string or ending in a number.
I don't have support for look ahead or look back.
What's the best approach?
Examples of strings that should match
http://example.com/foo/apple
http://example.com/foo/bear-bear
http://example.com/foo/cake-cake
Examples of strings that should NOT match
http://example.com/baa/apple
http://example.com/foo/apple/cake/bar/1
http://example.com/foo/bear-apple/camel/bar/2
Examples of strings that wouldn't exist in the data set
(So it doesn't matter if they match or not)
http://example.com/foo/bear-bear/cake/bar/two
http://example.com/foo/bear/camel/tar/2
http://example.com/foo/bear-bear/camel
http://example.com/foo/bear/camel/
http://example.com/foo/bear-bear/camel/tar/2
UPDATE
It turns out that the regex engine the application I'm using this in is from Elasticsearch, so this documentation (and one of our developers) was helpful: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html
The end solution was:
(http://example.com/foo.*)&~(.*bar.*)
All your examples have a specific prefix URL, followed by one-and-only-one path element. If this is the general case, you can do this by simply looking for the prefix URL followed by a word which doesn't contain a path separator, followed by EOL.
You didn't say what engine you're using, so here's an example with Gnu grep in bash:
grep -e '^http://example.com/foo/[^/]\+$'
Bash makes for readable examples, because single-quoting means very few characters need escaping. The sole exception in my example is the + character.
I would like to get all the results with grep or egrep from a file on my computer.
Just discovered that the regex of finding the string
'+33. ... ... ..' is by the following regex
\+33.[0-9].[0-9].[0-9].[0-9].' Or is this not correct?
My grep command is:
grep '\+31.[0-9].[0.9].[0.9].[0-9]' Samsung\ GT-i9400\ Galaxy\ S\ II.xry >> resultaten.txt
The output file is only giving me as following:
"Binary file Samsung GT-i9400 .xry matches"
..... and no results were given.
Can someone help me please with getting the results and writing to a file?
Firstly, the default behavior of grep is to print the line containing a match. Because binary files do not contain lines, it only prints a message when it finds a match in a binary file. However, this can be overridden with the -a flag.
But then, you end up with the problem that the "lines" it prints are not useful. You probably want to add the -o option to only print the substrings which actually matched.
Finally, your regex isn't correct at all. The lone dot . is a metacharacter which matches any character, including a control character or other non-text character. Given the length of your regex, you are unlikely to catch false positives, but you might want to explain what you want the dot to match. I have replaced it with [ ._-] which matches a space and some punctuation characters which are common in phone numbers. Maybe extend or change it, depending on what interpunction you expect in your phone numbers.
In regular grep, a plus simply matches itself. With grep -E the syntax would change, and you would need to backslash the plus; but in the absence of this option, the backslash is superfluous (and actually wrong in this context in some dialects, including GNU grep, where a backslashed plus selects the extended meaning, which is of course a syntax error at beginning of string, where there is no preceding expression to repeat one or more times; but GNU grep will just silently ignore it, rather than report an error).
On the other hand, your number groups are also wrong. [0-9] matches a single digit, where apparently the intention is to match multiple digits. For convenience, I will use the grep -E extension which enables + to match one or more repetitions of the previous character. Then we also get access to ? to mark the punctuation expressions as optional.
Wrapping up, try this:
grep -Eao '\+33[0-9]+([^ ._-]?[0-9]+){3}' \
'Samsung GT-i9400 Galaxy S II.xry' >resultaten.txt
In human terms, this requires a literal +33 followed by required additional digits, then followed by three number groups of one or more digits, each optionally preceded by punctuation.
This will overwrite resultaten.txt which is usually what you want; the append operation you had also makes sense in many scenarios, so change it back if that's actually what you want.
If each dot in your template +33. ... ... .. represents a required number, and the spaces represent required punctuation, the following is closer to what you attempted to specify:
\+33[0-9]([^ ._-][0-9]{3}){2}[^ ._-][0-9]{2}
That is, there is one required digit after 33, then two groups of exactly three digits and one of two, each group preceded by one non-optional spacing or punctuation character.
(Your exposition has +33 while your actual example has +31. Use whichever is correct, or perhaps allow any sequence of numbers for the country code, too.)
It means that you're find a match but the file you're greping isn't a text file, it's a binary containing non-printable bytes. If you really want to grep that file, try:
strings Samsung\ GT-i9400\ Galaxy\ S\ II.xry | grep '+31.[0-9].[0.9].[0.9].[0-9]' >> resultaten.txt
This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
ciĆ
ale
.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.
since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done
To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.