I am using wildcard characters (? and *) to search for files in Windows in a c++ program with _tfindfirst64 and _tfindnext64. I observed the following code
TCHAR root[1024] = L"C:/testData/?????_?????.jpg";
_tfinddata64_t c_file;
intptr_t hFile = _tfindfirst64(root, &c_file);
do
{
wcout << c_file.name << endl;
} while (_tfindnext64(hFile, &c_file) == 0);
_findclose(hFile);
picks up the following files:
12345_---.jpg
12345_12.jpg
12345_12345.jpg
But I was expected the filter would accept only the last one as it means: "5 arbitrary chars followed by '_' followed by another chars with '.jpg'" as described. What would be the correct way to do it?
As you've found, a ? in a file search doesn't require a character to be present, but matching will fail if a character is present that your search string doesn't account for. For example, foo?.txt will match foo.txt, foo1.txt, fooa.txt, and so on, but will not match foo10.txt or foo_abc.txt.
About the only way I know of to work around this is to re-check the result afterward to assure you eliminate any unwanted matches. In this case, it sounds like just comparing the length of the filename with what you expect will suffice.
Related
So I had this code working for a few months already, lets say I have a table called Categories, which has a string column called name, so I receive a string and I want to know if any category was mentioned (a mention occur when the string contains the substring: #name_of_a_category), the approach I follow for this was something like below:
categories.select { |category_i| content_received.downcase.match(/##{category_i.downcase}/)}
That worked pretty well until today suddenly started to receive an exception unmatched close parenthesis, I realized that the categories names can contain special chars so I decided to not consider special chars or spaces anymore (don't want to add restrictions to the user and at the same time don't want to deal with those cases so the policy is just to ignore it).
So the question is there a clean way of removing these special chars (maintaining the #) and matching the string (don't want to modify the data just ignore it while looking for mentions)?
You can also use
prep_content_received = content_received.gsub(/[^\w\s]|_/,'')
p categories.select { |c|
prep_content_received.match?(/\b#{c.gsub(/[^\w\s]|_/, '').strip()}\b/i)
}
See the Ruby demo
Details:
The prep_content_received = content_received.gsub(/[^\w\s]|_/,'') creates a copy of content_received with no special chars and _. Using it once reduced overhead if there are a lot of categories
Then, you iterate over the categories list, and each time check if the prep_content_received matches \b (word boundary) + category with all special chars, _ and leading/trailing whitespace stripped from it + \b in a case insensitive way (see the /i flag, no need to .downcase).
So after looking around I found some answers on the platform but nothing with my specific requirements (maybe I missed something, if so please let me know), and this is how I fix it for my case:
content_received = 'pepe is watching a #comedy :)'
categories = ['comedy :)', 'terror']
temp_content = content_received.downcase
categories.select { |category_i| temp_content.gsub(/[^\sa-zA-Z0-9]/, '#' => '#').match?(/##{category_i.downcase.
gsub(/[^\sa-zA-Z0-9]/, '')}/) }
For the sake of the example, I reduced the categories to a simple array of strings, basically the first gsub, remove any character that is not a letter or a number (any special character) and replace each # with an #, the second gsub is a simpler version of the first one.
You can test the snippet above here
This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 6 years ago.
I'm trying to write a regex for the following pattern:
[MyLiteralString][0 or more characters without restriction][at least 1 digit]
I thought this should do it:
(theColumnName)[\s\S]*[\d]+
As it looks for the literal string theColumnName, followed by any number of characters (whitespace or otherwise), and then at least one digit. But this matches more than I want, as you can see here:
https://www.regex101.com/r/HBsst1/1
(EDIT) Second set of more complex data - https://www.regex101.com/r/h7PCv7/1
Using the sample data in that link, I want the regex to identify the two occurrences of theColumnName] VARCHAR(10) and nothing more.
I have 300+ sql scripts which containing create statements for every type of database object: procedures, tables, triggers, indexes, functions -- everything. Because of that, I can't be too strict with my regex.
A stored procedure's file might include text like LEFT(theColumnName, 10) which I want to identify.
A create table statement would be like theColumnName VARCHAR(12).
So it needs to be very flexible as the number(s) isn't always the same. Sometimes it's 10, sometimes it's 12, sometimes it's 51 -- all kinds of different numbers.
Basically, I'm looking for the regex equivalent of this C# code:
//Get file data
string[] lines = File.ReadAllLines(filePath);
//Let's assume the first line contains 'theColumnName'
int theColumnNameIndex = lines[0].IndexOf("theColumnName");
if (theColumnNameIndex >= 0)
{
//Get the text proceeding 'theColumnName'
string temp = lines[0].Remove(0, theColumnNameIndex + "theColumnNameIndex".Length;
//Iterate over our substring
foreach (char c in temp)
{
if (Char.IsDigit(c))
//do a thing
}
}
(theColumnName).*?[\d]+
That'll make it stop capturing after the first number it sees.
The difference between * and *? is about greediness vs. laziness. .*\d for example would match abcd12ad4 in abcd12ad4, whereas .*?\d would have its first match as abcd1. Check out this page for more info.
Btw, if you don't want to match newlines, use a . (period) instead of [\s\S]
I would like to catch words between specific words in Matlab regular expression.
For example, If line = 'aaaa\bbbbb\ccccc....\wwwww.xyz' is given,
I would like to catch only wwwww.xyz.
aaaa ~ wwwww.xyz does not represent specific words and number of character- it means they can be any character excluding backslash and number of character can be more than 1. wwwww.xyz is always after last backslash. My problem is regexp(line,'\\.+\.xyz','match') does not always work since wwwww sometimes contain special character such as '-'.
Any suggestion is appreciated.
If you Must use regex for this, this regex should work:
[\\]?(?!.+\\)([^.]+\.[a-z]{3})
Working regex example:
http://regex101.com/r/fL5oS5
Example data:
aaaa\bbbbb\ccccc\ww%20-www.xyz
www-654_33.xyz
Matches:
1. ww%20-www.xyz
2. www-654_33.xyz
No solution provided here is likely to be 100% reliable unless you know that your data is carefully formatted (has the path string been escaped?). The question boils down to finding a word that is a valid path in line of text. It not so easy. We'll assume that all files have file extensions (this is not necessarily true in the context of paths). An arbitrary path can then might look like any of the following:
'wwwww.x'
'wwwww.xyz'
'\wwwww.xyz'
'ccccc\wwwww.xyz'
'\ccccc\wwwww.xyz'
...
str = 'The quick brown fox aaaa\bbbbb\ccccc\wwwww.xyz jumped over the lazy dog.';
matches = regexp(str,'\s\\?([^.\s\\]+\\)*([^.\s]+\.\w+)\s','tokens');
file_name = matches{1}(2)
which returns (for all of the cases above the extension is slightly different for the first case though)
file_name =
'wwwww.xyz'
If you know the filename extension is '.xyz', then you can use this instead:
matches = regexp(str,'\s\\?([^.\s\\]+\\)*([^.\s]+\.xyz)\s','tokens');
By the way, for a path, the fileparts function can be used:
str = 'aaaa\bbbbb\ccccc\wwwww.xyz'; % A Windows-only path
% str = 'aaaa/bbbbb/ccccc/wwwww.xyz'; % A UNiX or OS X path (works on Windows too)
[path_str,file_name,file_ext] = fileparts(str)
which returns
path_str =
aaaa\bbbbb\ccccc
file_name =
wwwww
file_ext =
.xyz
You can then get the filename with extension via
file_name_ext = [file_name file_ext];
Note also that that path_str omits the trailing file separator.
Assuming that the only thing that your strings have in common is that there is a file path separator, and you are interested in everything "from the last file path separator to the first whitespace", then you could try
['[\' filesep ']([^\' filesep ']+?)(?:\s|$)']
which on Windows platform would reduce to
\\([^\\]+?)(?:\s|$)
Demo:
http://regex101.com/r/jW5tT1
If you want to match the extension literally (.xyz in your example), change it to
\\([^\\]+?\.xyz)(?:\s|$)
"Find a backslash followed by the fewest (+?) number of "not backslash" until literal .xyz followed by a white space or end of string"
I need to store a string replacing its spaces with some character. When I retrieve it back I need to replace the character with spaces again. I have thought of this strategy while storing I will replace (space with _a) and (_a with _aa) and while retrieving will replace (_a with space) and (_aa with _a). i.e even if the user enters _a in the string it will be handled. But I dont think this is a good strategy. Please let me know if anyone has a better one?
Replacing spaces with something is a problem when something is already in the string. Why don't you simply encode the string - there are many ways to do that, one is to convert all characters to hexadecimal.
For instance
Hello world!
is encoded as
48656c6c6f20776f726c6421
The space is 0x20. Then you simply decode back (hex to ascii) the string.
This way there are no space in the encoded string.
-- Edit - optimization --
You replace all % and all spaces in the string with %xx where xx is the hex code of the character.
For instance
Wine having 12% alcohol
becomes
Wine%20having%2012%25%20alcohol
%20 is space
%25 is the % character
This way, neither % nor (space) are a problem anymore - Decoding is easy.
Encoding algorithm
- replace all `%` with `%25`
- replace all ` ` with `%20`
Decoding algorithm
- replace all `%xx` with the character having `xx` as hex code
(You may even optimize more since you need to encode only two characters: use %1 for % and %2 for , but I recommend the %xx solution as it is more portable - and may be utilized later on if you need to code more characters)
I'm not sure your solution will work. When reading, how would you
distinguish between strings that were orginally " a" and strings that
were originally "_a": if I understand correctly, both will end up
"_aa".
In general, given a situation were a specific set of characters cannot
appear as such, but must be encoded, the solution is to choose one of
allowed characters as an "escape" character, remove it from the set of
allowed characters, and encode all of the forbidden characters
(including the escape character) as a two (or more) character sequence
starting with the escape character. In C++, for example, a new line is
not allowed in a string or character literal. The escape character is
\; because of that, it must be encoded as an escape sequence as well.
So we have "\n" for a new line (the choice of n is arbitrary), and
"\\" for a \. (The choice of \ for the second character is also
arbitrary, but it is fairly usual to use the escape character, escaped,
to represent itself.) In your case, if you want to use _ as the
escape character, and "_a" to represent a space, the logical choice
would be "__" to represent a _ (but I'd suggest something a little
more visually suggestive—maybe ^ as the escape, with "^_" for
a space and "^^" for a ^). When reading, anytime you see the escape
character, the following character must be mapped (and if it isn't one
of the predefined mappings, the input text is in error). This is simple
to implement, and very reliable; about the only disadvantage is that in
an extreme case, it can double the size of your string.
You want to implement this using C/C++? I think you should split your string into multiple part, separated by space.
If your string is like this : "a__b" (multiple space continuous), it will be splited into:
sub[0] = "a";
sub[1] = "";
sub[2] = "b";
Hope this will help!
With a normal string, using X characters, you cannot write or encode a string with x-1 using only 1 character/input character.
You can use a combination of 2 chars to replace a given character (this is exactly what you are trying in your example).
To do this, loop through your string to count the appearances of a space combined with its length, make a new character array and replace these spaces with "//" this is just an example though. The problem with this approach is that you cannot have "//" in your input string.
Another approach would be to use a rarely used char, for example "^" to replace the spaces.
The last approach, popular in a combination of these two approaches. It is used in unix, and php to have syntax character as a literal in a string. If you want to have a " " ", you simply write it as \" etc.
Why don't you use Replace function
String* stringWithoutSpace= stringWithSpace->Replace(S" ", S"replacementCharOrText");
So now stringWithoutSpace contains no spaces. When you want to put those spaces back,
String* stringWithSpacesBack= stringWithoutSpace ->Replace(S"replacementCharOrText", S" ");
I think just coding to ascii hexadecimal is a neat idea, but of course doubles the amount of storage needed.
If you want to do this using less memory, then you will need two-letter sequences, and have to be careful that you can go back easily.
You could e.g. replace blank by _a, but you also need to take care of your escape character _. To do this, replace every _ by __ (two underscores). You need to scan through the string once and do both replacements simultaneously.
This way, in the resulting text all original underscores will be doubled, and the only other occurence of an underscore will be in the combination _a. You can safely translate this back. Whenever you see an underscore, you need a lookahed of 1 and see what follows. If an a follows, then this was a blank before. If _ follows, then it was an underscore before.
Note that the point is to replace your escape character (_) in the original string, and not the character sequence to which you map the blank. Your idea with replacing _a breaks. as you do not know if _aa was originally _a or a (blank followed by a).
I'm guessing that there is more to this question than appears; for example, that you the strings you are storing must not only be free of spaces, but they must also look like words or some such. You should be clear about your requirements (and you might consider satisfying the curiosity of the spectators by explaining why you need to do such things.)
Edit: As JamesKanze points out in a comment, the following won't work in the case where you can have more than one consecutive space. But I'll leave it here anyway, for historical reference. (I modified it to compress consecutive spaces, so it at least produces unambiguous output.)
std::string out;
char prev = 0;
for (char ch : in) {
if (ch == ' ') {
if (prev != ' ') out.push_back('_');
} else {
if (prev == '_' && ch != '_') out.push_back('_');
out.push_back(ch);
}
prev = ch;
}
if (prev == '_') out.push_back('_');
I am wanting to create a regular expression for the following scenario:
If a string contains the percentage character (%) then it can only contain the following: %20, and cannot be preceded by another '%'.
So if there was for instance, %25 it would be rejected. For instance, the following string would be valid:
http://www.test.com/?&Name=My%20Name%20Is%20Vader
But these would fail:
http://www.test.com/?&Name=My%20Name%20Is%20VadersAccountant%25
%%%25
Any help would be greatly appreciated,
Kyle
EDIT:
The scenario in a nutshell is that a link is written to an encoded state and then launched via JavaScript. No decoding works. I tried .net decoding and JS decoding, each having the same result - The results stay encoded when executed.
Doesn't require a %:
/^[^%]*(%20[^%]*)*$/
Which language are you using?
Most languages have a Uri Encoder / Decoder function or class.
I would suggest you decode the string first and than check for valid (or invalid) characters.
i.e. something like /[\w ]/ (empty is a space)
With a regex in the first place you need to respect that www.example.com/index.html?user=admin&pass=%%250 means that the pass really is "%250".
Another solution if look-arounds are not available:
^([^%]|%([013-9a-fA-F][0-9a-fA-F]|2[1-9a-fA-F]))*$
Reject the string if it matches %[^2][^0]
I think that would find what you need
/^([^%]|%%|%20)+$/
Edit: Added case where %% is valid string inside URI
Edit2: And fixed it for case where it should fail :-)
Edit3:
In case you need to use it in editor (which would explain why you can't use more programmatic way), then you have to correctly escape all special characters, for example in Vim that regex should lool:
/^\([^%]\|%%\|%20\)\+$/
Maybe a better approach is to deal with that validation after you decode that string:
string name = HttpUtility.UrlDecode(Request.QueryString["Name"]);
/^([^%]|%20)*$/
This requires a test against the "bad" patterns. If we're allowing %20 - we don't need to make sure it exists.
As others have said before, %% is valid too... and %%25would be %25
The below regex matches anything that doesn't fit into the above rules
/(?<![^%]%)%(?!(20|%))/
The first brackets check whether there is a % before the character (meaning that it's %%) and also checks that it's not %%%. it then checks for a %, and checks whether the item after doesn't match 20
This means that if anything is identified by the regex, then you should probably reject it.
I agree with dominic's comment on the question. Don't use Regex.
If you want to avoid scanning the string twice, you can just iteratively search for % and then check that it is being followed by 20 and nothing else. (Update: allow a % after to be interpreted as a literal %nnn sequence)
// pseudo code
pos = 0
while (pos = mystring.find(pos, '%'))
{
if mystring[pos+1] = "%" then
pos = pos + 2 // ok, this is a literal, skip ahead
else if mystring.substring(pos,2) != "20"
return false; // string is invalid
end if
}
return true;