Regex to catch all files but those starting with "." - regex

In a directory with mixed content such as:
.afile
.anotherfile
bfile.file
bnotherfile.file
.afolder/
.anotherfolder/
bfolder/
bnotherfolder/
How would you catch everything but the files (not dirs) starting with .?
I have tried with a negative lookahead ^(?!\.).+? but it doesn't seem to work right.
Please note that I would like to avoid doing it by excluding the . by using [a-zA-Z< plus all other possible chars minus the dot >]
Any suggestions?

This should do it:
^[^.].*$
[^abc] will match anything that is not a, b or c

Escaping .and negating the characters that can start the name you have:
^[^\.].*$
Tested successfully with your test cases here.

The negative lookahead ^(?!\.).+$ does work. Here it is in Java:
String[] files = {
".afile",
".anotherfile",
"bfile.file",
"bnotherfile.file",
".afolder/",
".anotherfolder/",
"bfolder/",
"bnotherfolder/",
"",
};
for (String file : files) {
System.out.printf("%-18s %6b%6b%n", file,
file.matches("^(?!\\.).+$"),
!file.startsWith(".")
);
}
The output is (as seen on ideone.com):
.afile false false
.anotherfile false false
bfile.file true true
bnotherfile.file true true
.afolder/ false false
.anotherfolder/ false false
bfolder/ true true
bnotherfolder/ true true
false true
Note also the use of the non-regex String.startsWith. Arguably this is the best, most readable solution, because regex is not needed anyway, and startsWith is O(1) where as the regex (at least in Java) is O(N).
Note the disagreement on the blank string. If this is a possible input, and you want this to return false, you can write something like this:
!file.isEmpty() && !file.startsWith(".")
See also
Is regex too slow? Real life examples where simple non-regex alternative is better
In Java, .* even in Pattern.DOTALL mode takes O(N) to match.

Uhm... how about a negative character class?
[^.]
to exclude the dot?

Related

Shorten Regular Expression (\n) [duplicate]

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Regex Find English char in text need more than 3

I want to validate a text that need have more than 3 [aA-zZ] chars, not need continous.
/^(?![_\-\s0-9])(?!.*?[_\-\s]$)(?=.*[aA-zZ]{3,})[_\-\sa-zA-Z0-9]+$/.test("aaa123") => return true;
/^(?![_\-\s0-9])(?!.*?[_\-\s]$)(?=.*[aA-zZ]{3,})[_\-\sa-zA-Z0-9]+$/.test("a1b2c3") => return false;
Can anybody help me?
How about replacing and counting?
var hasFourPlusChars = function(str) {
return str.replace(/[^a-zA-Z]+/g, '').length > 3;
};
console.log(hasFourPlusChars('testing1234'));
console.log(hasFourPlusChars('a1b2c3d4e5'));
You need to group .* and [a-zA-Z] in order to allow optional arbitrary characters between English letters:
^(?![_\-\s0-9])(?!.*?[_\-\s]$)(?=(?:.*[a-zA-Z]){3,})[_\-\sa-zA-Z0-9]+$
^^^ ^
Add this
Demo:
var re = /^(?![_\-\s0-9])(?!.*?[_\-\s]$)(?=(?:.*[aA-zZ]){3,})[_\-\sa-zA-Z0-9]+$/;
console.log(re.test("aaa123"));
console.log(re.test("a1b2c3"));
By the way, [aA-zZ] is not a correct range definition. Use [a-zA-Z] instead. See here for more details.
Correction of the regex
Your repeat condition should include the ".*". I did not check if your regex is correct for what you want to achieve, but this correction works for the following strings:
$testStrings=["aaa123","a1b2c3","a1b23d"];
foreach($testStrings as $s)
var_dump(preg_match('/^(?![_\-\s0-9])(?!.*?[_\-\s]$)(?=.*[a-zA-Z]){3,}[_\-\sa-zA-Z0-9]+$/', $s));
Other implementations
As the language seems to be JavaScript, here is an optimised implementation for what you want to achieve:
"a24be4Z".match(/[a-zA-Z]/g).length>=3
We get the list of all matches and check if there are at least 3.
That is not the "fastest" way as the result needs to be created.
)
/(?:.*?[a-zA-Z]){3}/.test("a24be4Z")
is faster. ".*?" avoids that the "test" method matches all characters up to the end of the string before testing other combinations.
As expected, the first suggestion (counting the number of matches) is the slowest.
Check https://jsperf.com/check-if-there-are-3-ascii-characters .

Find repeated words in a string separated by "/"

Assume the following vector:
x <- c("/default/img/irs/irs/irs/irs/irs/irs/irs/irs/irs/irs/irs/irs/IRS.html/", "something/repeat/repeat_this")
I want to check whether a word enclosed by / is repeated (Note that / might be missing from start and end of string). I found the following brilliant piece of regex here but (after I strip special characters) I can't seem to modify it to fit my case:
grepl("\\b(\\S+?)\\1\\S*\\b", x, perl = TRUE)
# [1] TRUE TRUE
I can always str_split(x, "/") and iterate the duplicated() function over the list and use an if() statement but that would be terribly inefficient.
Desired outcome should be a vector with TRUE or FALSE (or 1 and 0).
Other solution if you only want to check your pattern
grepl(x, pattern = "((.+)/).*(/\\2(/|$))", perl=T)
where (.+)represents the word itself (capture group 2) appearing before a slash, the .* allows an arbitrary length of characters, digits and whitespaces to occur between two equal substrings. (/\\2(/|$)) then matches if the word occurs after a slash followed by either another slash or the end of the string ($).
For extraction you can use strsplit() as elaborated above.
I think the following could work for you. First, fixed = TRUE in strsplit() bypasses the regex engine and goes straight to exact matching, making the function much faster. Next, anyDuplicated() returns a length one integer result which will be zero if no duplicates are found, and greater than zero otherwise. So we can split the string with strsplit() and iterate anyDuplicated() over the result. Then we can compare the resulting vector with zero.
vapply(strsplit(x, "/", fixed = TRUE), anyDuplicated, 1L) > 0L
# [1] TRUE FALSE
To be safe, you may want to remove any leading /, since it will produce an empty character in the result from strsplit() and could produce misleading results in some cases (e.g. cases where the string begins with a / and irs//irs or similar occurs later in the string). You can remove leading forward slashes with sub("^/", "", x).
In summary, the ways to make your strsplit() idea faster are:
use fixed = TRUE in strsplit() to bypass the regex engine
use anyDuplicated() since it stops looking after it finds one match
use vapply() since we know what the result type and length will be

disallow repetition of given set of characters

I need to make a regex which will reject string with any given character in set next to each other
". / - ( )"
For example:
123()123 - false
123--123 - false
124((123 - false
123(123)123-12-12 - true
This is what i have done so far:
(?:([\/().-])(?!.*\1))
You can use :
(^(?:(?![.\/()-]{2}).)*$)
DEMO
Explanation :
^((?![\/().-]{2}).)*$
This simply negates the regex [\/().-]{2} which matches if two of your characters are next to each other.
See this answer for further explanation.
Live demo
Maybe it is easier to do it other way around, match strings you don't want to allow.
if match [.\/()-]{2}
not allowed
else
allowed
end

RegEx to match any character except strings with surrounding prefixes

I'm trying to create a regex which will validate input to match any character but it should exclude input which is surrounded with prefixes (i.e. {#..#}, {#..#} or {$..$}).
Given is the example:
Free text which is fine // should return true
{#some other text#} // should return false
Text with numbers 671 // should return true
{#Hello world#} // should return false
{$Hello Mars$} // should return false
{$some text which i do not close // should return true
This should be possible with the use of negative look-around, something along the lines of:
^(\?!=(\{#[^.]+#\}))(\?!=(\{$[^.]+$\}))(\?!=(\{#[^.]+#\})).*
Any help would be greatly appreciated : )
You syntax is a bit weird if you ask me... I would suggest:
^(?!\{(?:[$##]).*(?:[$##])\}).*
Demo
First 'group' is \{(?:[$##]) which looks for the opening prefix, then .* to match everything in the middle and (?:[$##])\} to match the closing suffix.
Note that it will not allow things like:
{$Hello Mars$} how are you?
If you want it to accept this as well, add an end of line anchor:
^(?!\{(?:[$##]).*(?:[$##])\}$).*
^
Demo
You can use the character class to have the different symbols [#$#] and is shorter than having multiple negative lookarounds or | operators in side :)
EDIT: To prevent things like {#Free text which is fine$}' you could use:
^(?!\{([$##]).*\1\}).*
Or
^(?!\{([$##]).*\1\}$).*
For the second version.
\1 is a backreference and refers to the first captured group (any of $, #, or #).