import java.util.regex._
object RegMatcher extends App {
val str="facebook.com"
val urlpattern="(http://|https://|file://|ftp://)?(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?"
var regex_list: Set[(String, String)] = Set()
val url=Pattern.compile(urlpattern)
var m=url.matcher(str)
if (m.find()) {
regex_list += (("date", m.group(0)))
println("match: " + m.group(0))
}
val str2="url is ftp://filezilla.com"
m=url.matcher(str2)
if (m.find()) {
regex_list += (("date", m.group(0)))
println("str 2 match: " + m.group(0))
}
}
This returns
match: facebook.com
str 2 match: url is ftp:
How do I manage the regex pattern so that both the strings are matched well.
What do the symbols actually mean in regex. I am very new to regex. Please help.
I read your regex as:
0 or 1 (? modifier) of the schemes (http://, https://, etc.)
followed by 0 or 1 instance of www.,
followed by 1 or more (+ modifier ) alphanumeric characters ,
followed by any character ( . is a regex special character, remember, standing for any one character),
followed by 0 or more (* modifier) alphanumerics,
followed by any character (. again)
followed by 3 lowercase letters ({3} being an exact count modifier)
followed by 0 or 1 of any character (.?)
followed by one or more lowecase letters.
If you plug your regex into regex101.com, you'll not only see a similar breakdown ( without any errors I might have made, though I think i nailed it), and you'll also have a chance to test various strings against it. Then, once you have your regexes working the way you want, you can bring them back to your script. It's a solid workflow for both learning regexes and developing an expression for a particular purpose.
If you drop your regex and your inputs into regex 101, you'll see why you're getting the output you see. But here's a hint: when you ask your regular expression to match "url is ftp://filezilla.com", nothing excludes "url is" from being part of the match. That's why you're not matching the scheme you want. Regex101 really is a great way to investigate this further.
The regex can be updated to
((ftp|https|http?):\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,})
This is all I needed.
I have string formulas like this:
?{a,b,c,d}
It can be can be embedded like this:
?{a,b,c,?{x,y,z}}
or this is the same:
?{a,b,c,
?{x,y,z}
}
So I have to find those commas, what are in the second and greather "level" brackets.
In the example below I marked the "levels" where I have to find all commas:
?{a,b,c,
?{x,y, <--Those
?{1,2,3} <--Those
}
}
I've tried with lookahead and lookbehind, but I'm totally confused now :/
Here is my latest working try, but it is not good at all:
OnlineRegex
Update:
To avoid misunderstanding, I don't want to count the commas.
I'd like to get groups of commas to replace them.
The condition is find the commas where more than one "open tags" before it like this: ?{
.. without closing tag like this: }
Examlpe.:
In this case I have not replace any commas:
?{1,2,3} ?{a,b,c}
But in this case I have to replace commas between a b c
?{1,2,3,?{a,b,c}}
For the examples which you have provided, the following regex works(gives the desired output as mentioned by you):
(?<!^\?{[^{}]*),(?=[\s\S]*(?:\s*}){2,})
For String ?{a,b,c,d}, see Demo1 No Match
For String, ?{a,b,c,?{x,y,z}}, see Demo2 Match successful
For String,
?{a,b,c,
?{x,y,z}
}
see Demo3 Match Successful
For String,
?{a,b,c,
?{x,y,
?{1,2,3}
}
}
see Demo4 Match Successful
For String ?{1,2,3} ?{a,b,c} ?{1,2,3} ?{a,b,c}, see Demo5 No Match
Explanation:
(?<!^\?{[^{}]*), - negative lookbehind to discard the 1st level commas. The logic applied here is it should not match the comma which is preceded by start of the string followed by ?{ followed by 0+ occurrences of any character except { or }
(?=[\s\S]*(?:\s*}){2,}) - The comma matched above must be followed by atleast 2 occurrences of }(consecutive or having only whitespaces between them)
Your question is rather unclear #norbre, but I presume you'd like to extract (i.e. "count") the number of commas.
You can't do this with a regex. Regexps can't count number of occurences. However, you can use this to extract the "internal part" and then use a spreadsheet formula to count number of commas:
^(?:\?{[a-zA-Z0-9,]+?,\n??\s*?\?{)([a-zA-Z0-9,?{}\n\s]+?(?:\n*?\s*?|})+)(?:[a-zA-Z0-9,\n\s]*})$
Try: https://regex101.com/r/Rr0eFo/5
Examples
1.
Input:
?{a,b,c,?{e,f},1,2,3}
Output:
e,f}
2.
Input:
?{a,b,c,
?{x,y,z,e,
?{1,2,3,?{f,g,3},4,5,6}
}
,d,e,f}
Output:
x,y,z,e,
?{1,2,3,?{f,g,3},4,5,6}
}
3.
Input:
?{a,b,c,?{e},1,2,3}
Output:
e}
(note that there are no commas here!)
One caveat however. As I have said, regexps can't count number of occurences.
Hence, the following sample (don't know if it's valid or not for your case) would return wrong match:
?{a,b,c,?{e,f}
,1,2,3,?{a,b}
}
Output:
e,f}
,1,2,3,?{a,b}
OK replacing commas is another story so I'll add another answer.
Your regexp engine would need to support recursion.
Still I don't see a way to do it with one regex - one match would either contain the first comma or contain everything between the braces!
What I suggest is to use one regexp to get "what is inside the inner braces", run a replace (, => "") and assemble the whole line again using submatches from the regexp.
Here it is: (\?{[^?{}]*)((?>[^?{}]|(?R))+?)([^?{}]*?\})
Try: https://regex101.com/r/IzTeY0/3
Example 1:
Input:
?{a,b,c,
?{x,y,z,e,
?{1,2,3,?{f,g,3},4,5,6}
}
,d,e,f}
Submatches:
1. ?{a,b,c,
2. ?{x,y,z,e,
?{1,2,3,?{f,g,3},4,5,6}
}
3.
,d,e,f}
Replace all commas in submatch 2 with anything you want, then reassamble the whole string using submatches 1 and 3.
Again, this would break the regexp:
?{a,b,c,?{e,f}
,1,2,3,?{a,b}
}
Submatch 2 would look like this:
?{e,f}
,1,2,3,?{a,b}
Learning regex in bash, i am trying to fetch all lines which ends with .com
Initially i did :
cat patternNpara.txt | egrep "^[[:alnum:]]+(.com)$"
why : +matches one or more occurrences, so placing it after alnum should fetch the occurrence of any digit,word or signs but apparently, this logic is failing....
Then i did this : (purely hit-and-try, not applying any logic really...) and it worked
cat patternNpara.txt | egrep "^[[:alnum:]].+(.com)$"
whats confusing me : . matches only single occurrence, then, how am i getting the output...i mean how is it really matching the pattern???
Question : whats the difference between [[:alnum:]]+ and [[:alnum:]].+ (this one has . in it) in the above matching pattern and how its working???
PS : i am looking for a possible explanation...not, try it this way thing... :)
Some test lines for the file patternNpara.txt which are fetched as output!
valid email = abc#abc.com
invalid email = ab#abccom
another invalid = abc#.com
1 : abc,s,11#gmail.com
2: abc.s.11#gmail.com
Looking at your screenshot it seems you're trying to match email address that has # character also which is not included in your regex. You can use this regex:
egrep "[#[:alnum:]]+(\.com)" patternNpara.txt
DIfference between 2 regex:
[[:alnum:]] matches only [a-zA-Z0-9]. If you have # or , then you need to include them in character class as well.
Your 2nd case is including .+ pattern which means 1 or more matches of ANY CHARACTER
If you want to match any lines that end with '.com', you should use
egrep ".*\.com$" file.txt
To match all the following lines
valid email = abc#abc.com
invalid email = ab#abccom
another invalid = abc#.com
1 : abc,s,11#gmail.com
2: abc.s.11#gmail.com
^[[:alnum:]].+(.com)$ will work, but ^[[:alnum:]]+(.com)$ will not. Here is the reasons:
^[[:alnum:]].+(.com)$ means to match strings that start with a a-zA-Z or 0-9, flows two or more any characters, and end with a 'com' (not '.com').
^[[:alnum:]]+(.com)$ means to match strings that start with one or more a-zA-Z or 0-9, flows one character that could be anything, and end with a 'com' (not '.com').
Try this (with "positive-lookahead") :
.+(?=\.com)
Demo :
http://regexr.com?38bo0
I am using TextCrawler *regxp* to align existing plain text file.
Text inside the file are continuous without line break.
....moredata....
,actor's list:
Amy Brenneman, Aaron Eckhart, Catherine Keener, Natassja Kinski
, Jason Patric, Ben Stiller,
movies released:
Gladiator,Matrix Reloaded,The Shawshank Redemption,Pirates of the Caribbean
- Curse of the Black Pearl,Monsters Inc,
genre:
SciFi,Romance,Drama,Action,Comedy,Advenure,Animated,Western,Horror
....moredata....
I am trying to find the string(s) between the comma and the colon and replace with the same but with new line added before found pattern.
I tried following, but it matching string form outermost comma to colon.
[,]{1}.[A-Z].*[:]
Any idea on the same ? Where i went wrong?
Why not use this pattern:
search: (?<=,)[^,:]+(?=:)
replace: \n$0
pattern details:
(?<=,) # lookbehind assertion: only a check that means "preceded by ,"
[^,:]+ # negated char class: all characters except , and :
(?=:) # lookahead assertion: only a check that means "followed by :"
Lookarounds are only tests that can make the pattern fail or succeed, they are not part of the match result.
The below mentioned pattern works:
Search Pattern : (,?[^:,]+:)
Replacement String : \n\1\n
For eg:
Given a file a.txt with contents :
actor's list:A,B,C,movies released:D,E,F,genre:G,H,I
perl -pe "s#(,?[^:,]+:)#\n\1\n#g" a.txt
The above command produces a output of the below format :
actor's list:
A,B,C
,movies released:
D,E,F
,genre:
G,H,I
I hope the the above output is what you are expecting.