Glib regex for matching whole word? - regex

For matching a whole word, the regex \bword\b should suffice. Yet the following code always returns 0 matches
try {
string pattern = "\bhtml\b";
Regex wordRegex = new Regex (pattern, RegexCompileFlags.CASELESS, RegexMatchFlags.NOTEMPTY);
MatchInfo matchInfo;
string lineOfText = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">";
wordRegex.match (lineOfText, RegexMatchFlags.NOTEMPTY, out matchInfo);
stdout.printf ("Match count is: %d\n", matchInfo.get_match_count ());
} catch (RegexError regexError) {
stderr.printf ("Regex error: %s\n", regexError.message);
}
This should be working as testing the \bhtml\b pattern returns one match for the provided string in testing engines. But on this program it returns 0 matches. Is the code wrong? What regex in Glib would be used to match a whole word?

It looks like you have to escape the backslash too:
try {
string pattern = "\\bhtml\\b";
Regex wordRegex = new Regex (pattern, RegexCompileFlags.CASELESS, RegexMatchFlags.NOTEMPTY);
MatchInfo matchInfo;
string lineOfText = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">";
wordRegex.match (lineOfText, RegexMatchFlags.NOTEMPTY, out matchInfo);
stdout.printf ("Match count is: %d\n", matchInfo.get_match_count ());
} catch (RegexError regexError) {
stderr.printf ("Regex error: %s\n", regexError.message);
}
Output:
Match count is: 1
Demo

You can simplify your code with regular expression literals:
Regex regex = /\bhtml\b/i;
You don't have to quote backslashes in the regular expression literal syntax. (Front slashes would be problematic though.)
Full example:
void test_match (string text, Regex regex) {
MatchInfo match_info;
if (regex.match (text, RegexMatchFlags.NOTEMPTY, out match_info)) {
stdout.printf ("Match count is: %d\n", match_info.get_match_count ());
}
else {
stdout.printf ("No match");
}
}
int main () {
Regex regex = /\bhtml\b/i;
test_match ("<!DOCTYPE html PUBLIC>", regex);
return 0;
}

Related

Regular Expression Matcher

I am using pattern matching to match file extension with my expression String for which code is as follows:-
public static enum FileExtensionPattern
{
WORDDOC_PATTERN( "([^\\s]+(\\.(?i)(txt|docx|doc))$)" ), PDF_PATTERN(
"([^\\s]+(\\.(?i)(pdf))$)" );
private String pattern = null;
FileExtensionPattern( String pattern )
{
this.pattern = pattern;
}
public String getPattern()
{
return pattern;
}
}
pattern = Pattern.compile( FileExtensionPattern.WORDDOC_PATTERN.getPattern() );
matcher = pattern.matcher( fileName );
if ( matcher.matches() )
icon = "blue-document-word.png";
when file name comes as "Home & Artifact.docx" still matcher.matches returns false.It works fine with filename with ".doc" extension.
Can you please point out what i am doing wrong.
"Home & Artifact.docx" contains spaces. Since you allow any char except whitespaces [^\s]+, this filename is not matched.
Try this instead:
(.+?(\.(?i)(txt|docx|doc))$
It is because you have spaces in filename ("Home & Artifact.docx") but your regex has [^\\s]+ which won't allow any spaces.
Use this regex instead for WORDDOC_PATTERN:
"(?i)^.+?\\.(txt|docx|doc)$"

What is the regular expression for matching a string that starts with "<tag1 attribute="0">" and ends with "</tag>"?

What is the regular expression for matching a string that starts with <tag1 attribute="0"> and ends with </tag>?
Regex: <tag1 attribute=”0“>.*?<\/tag>
Test code:
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
string input = #"this is <tag1 attribute=""0""> and ends with </tag>, okay?";
Console.WriteLine(input);
Match match = Regex.Match(input, #"<tag1 attribute=""0"">.*?<\/tag>",
RegexOptions.IgnoreCase);
if (match.Success)
{
Console.WriteLine("Match!");
}
}
}

GWT Regex and empty string

Could someone explain why this snip :
// import com.google.gwt.regexp.shared.MatchResult;
// import com.google.gwt.regexp.shared.RegExp;
RegExp regExp = RegExp.compile("^$");
MatchResult matcher;
while ((matcher = regExp.exec("")) != null)
{
System.out.println("match " + matcher);
}
give an incredible count of matches? I tested with different modifier allowed by GWT implementation of compile(), g, i and m. It works only with m (multiline).
I just want to check for empty string.
[EDIT] the new method
private ArrayList<MatchResult> getMatches(String input, String pattern)
{
ArrayList<MatchResult> matches = new ArrayList<MatchResult>();
if(null == regExp)
{
regExp = RegExp.compile(pattern, "g");
}
if(input.isEmpty())
{
// empty string : just check if pattern validate and
// don't try to extract matches : it will resutl in infinite
// loop.
if(regExp.test(input))
{
matches.add(new MatchResult(0, "", new ArrayList<String>(0)));
}
}
else
{
for(MatchResult matcher = regExp.exec(input); matcher != null; matcher = regExp
.exec(input))
{
matches.add(matcher);
}
}
return matches;
}
Your regExp.exec("") with RegExp.compile("^$") will never return null, as the empty string "" is a match for regex ^$, which reads "nothing between beginning and the end of line/string".
So your while is infinity loop.
Also, you print is
System.out.println("match " + matcher);
...but you probably wanted to use
System.out.println("match " + matcher.getGroup(0));
Also see GWT checking if textbox is empty.

How do I check if a filename matches a wildcard pattern

I've got a wildcard pattern, perhaps "*.txt" or "POS??.dat".
I also have list of filenames in memory that I need to compare to that pattern.
How would I do that, keeping in mind I need exactly the same semantics that IO.DirectoryInfo.GetFiles(pattern) uses.
EDIT: Blindly translating this into a regex will NOT work.
I have a complete answer in code for you that's 95% like FindFiles(string).
The 5% that isn't there is the short names/long names behavior in the second note on the MSDN documentation for this function.
If you would still like to get that behavior, you'll have to complete a computation of the short name of each string you have in the input array, and then add the long name to the collection of matches if either the long or short name matches the pattern.
Here is the code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace FindFilesRegEx
{
class Program
{
static void Main(string[] args)
{
string[] names = { "hello.t", "HelLo.tx", "HeLLo.txt", "HeLLo.txtsjfhs", "HeLLo.tx.sdj", "hAlLo20984.txt" };
string[] matches;
matches = FindFilesEmulator("hello.tx", names);
matches = FindFilesEmulator("H*o*.???", names);
matches = FindFilesEmulator("hello.txt", names);
matches = FindFilesEmulator("lskfjd30", names);
}
public string[] FindFilesEmulator(string pattern, string[] names)
{
List<string> matches = new List<string>();
Regex regex = FindFilesPatternToRegex.Convert(pattern);
foreach (string s in names)
{
if (regex.IsMatch(s))
{
matches.Add(s);
}
}
return matches.ToArray();
}
internal static class FindFilesPatternToRegex
{
private static Regex HasQuestionMarkRegEx = new Regex(#"\?", RegexOptions.Compiled);
private static Regex IllegalCharactersRegex = new Regex("[" + #"\/:<>|" + "\"]", RegexOptions.Compiled);
private static Regex CatchExtentionRegex = new Regex(#"^\s*.+\.([^\.]+)\s*$", RegexOptions.Compiled);
private static string NonDotCharacters = #"[^.]*";
public static Regex Convert(string pattern)
{
if (pattern == null)
{
throw new ArgumentNullException();
}
pattern = pattern.Trim();
if (pattern.Length == 0)
{
throw new ArgumentException("Pattern is empty.");
}
if(IllegalCharactersRegex.IsMatch(pattern))
{
throw new ArgumentException("Pattern contains illegal characters.");
}
bool hasExtension = CatchExtentionRegex.IsMatch(pattern);
bool matchExact = false;
if (HasQuestionMarkRegEx.IsMatch(pattern))
{
matchExact = true;
}
else if(hasExtension)
{
matchExact = CatchExtentionRegex.Match(pattern).Groups[1].Length != 3;
}
string regexString = Regex.Escape(pattern);
regexString = "^" + Regex.Replace(regexString, #"\\\*", ".*");
regexString = Regex.Replace(regexString, #"\\\?", ".");
if(!matchExact && hasExtension)
{
regexString += NonDotCharacters;
}
regexString += "$";
Regex regex = new Regex(regexString, RegexOptions.Compiled | RegexOptions.IgnoreCase);
return regex;
}
}
}
}
You can simply do this. You do not need regular expressions.
using Microsoft.VisualBasic.CompilerServices;
if (Operators.LikeString("pos123.txt", "pos?23.*", CompareMethod.Text))
{
Console.WriteLine("Filename matches pattern");
}
Or, in VB.Net,
If "pos123.txt" Like "pos?23.*" Then
Console.WriteLine("Filename matches pattern")
End If
In c# you could simulate this with an extension method. It wouldn't be exactly like VB Like, but it would be like...very cool.
You could translate the wildcards into a regular expression:
*.txt -> ^.+\.txt$
POS??.dat _> ^POS..\.dat$
Use the Regex.Escape method to escape the characters that are not wildcars into literal strings for the pattern (e.g. converting ".txt" to "\.txt").
The wildcard * translates into .+, and ? translates into .
Put ^ at the beginning of the pattern to match the beginning of the string, and $ at the end to match the end of the string.
Now you can use the Regex.IsMatch method to check if a file name matches the pattern.
Just call the Windows API function PathMatchSpecExW().
[Flags]
public enum MatchPatternFlags : uint
{
Normal = 0x00000000, // PMSF_NORMAL
Multiple = 0x00000001, // PMSF_MULTIPLE
DontStripSpaces = 0x00010000 // PMSF_DONT_STRIP_SPACES
}
class FileName
{
[DllImport("Shlwapi.dll", SetLastError = false)]
static extern int PathMatchSpecExW([MarshalAs(UnmanagedType.LPWStr)] string file,
[MarshalAs(UnmanagedType.LPWStr)] string spec,
MatchPatternFlags flags);
/*******************************************************************************
* Function: MatchPattern
*
* Description: Matches a file name against one or more file name patterns.
*
* Arguments: file - File name to check
* spec - Name pattern(s) to search foe
* flags - Flags to modify search condition (MatchPatternFlags)
*
* Return value: Returns true if name matches the pattern.
*******************************************************************************/
public static bool MatchPattern(string file, string spec, MatchPatternFlags flags)
{
if (String.IsNullOrEmpty(file))
return false;
if (String.IsNullOrEmpty(spec))
return true;
int result = PathMatchSpecExW(file, spec, flags);
return (result == 0);
}
}
Some kind of regex/glob is the way to go, but there are some subtleties; your question indicates you want identical semantics to IO.DirectoryInfo.GetFiles. That could be a challenge, because of the special cases involving 8.3 vs. long file names and the like. The whole story is on MSDN.
If you don't need an exact behavioral match, there are a couple of good SO questions:
glob pattern matching in .NET
How to implement glob in C#
For anyone who comes across this question now that it is years later, I found over at the MSDN social boards that the GetFiles() method will accept * and ? wildcard characters in the searchPattern parameter. (At least in .Net 3.5, 4.0, and 4.5)
Directory.GetFiles(string path, string searchPattern)
http://msdn.microsoft.com/en-us/library/wz42302f.aspx
Plz try the below code.
static void Main(string[] args)
{
string _wildCardPattern = "*.txt";
List<string> _fileNames = new List<string>();
_fileNames.Add("text_file.txt");
_fileNames.Add("csv_file.csv");
Console.WriteLine("\nFilenames that matches [{0}] pattern are : ", _wildCardPattern);
foreach (string _fileName in _fileNames)
{
CustomWildCardPattern _patetrn = new CustomWildCardPattern(_wildCardPattern);
if (_patetrn.IsMatch(_fileName))
{
Console.WriteLine("{0}", _fileName);
}
}
}
public class CustomWildCardPattern : Regex
{
public CustomWildCardPattern(string wildCardPattern)
: base(WildcardPatternToRegex(wildCardPattern))
{
}
public CustomWildCardPattern(string wildcardPattern, RegexOptions regexOptions)
: base(WildcardPatternToRegex(wildcardPattern), regexOptions)
{
}
private static string WildcardPatternToRegex(string wildcardPattern)
{
string patternWithWildcards = "^" + Regex.Escape(wildcardPattern).Replace("\\*", ".*");
patternWithWildcards = patternWithWildcards.Replace("\\?", ".") + "$";
return patternWithWildcards;
}
}
For searching against a specific pattern, it might be worth using File Globbing which allows you to use search patterns like you would in a .gitignore file.
See here: https://learn.microsoft.com/en-us/dotnet/core/extensions/file-globbing
This allows you to add both inclusions & exclusions to your search.
Please see below the example code snippet from the Microsoft Source above:
Matcher matcher = new Matcher();
matcher.AddIncludePatterns(new[] { "*.txt" });
IEnumerable<string> matchingFiles = matcher.GetResultsInFullPath(filepath);
The use of RegexOptions.IgnoreCase will fix it.
public class WildcardPattern : Regex {
public WildcardPattern(string wildCardPattern)
: base(ConvertPatternToRegex(wildCardPattern), RegexOptions.IgnoreCase) {
}
public WildcardPattern(string wildcardPattern, RegexOptions regexOptions)
: base(ConvertPatternToRegex(wildcardPattern), regexOptions) {
}
private static string ConvertPatternToRegex(string wildcardPattern) {
string patternWithWildcards = Regex.Escape(wildcardPattern).Replace("\\*", ".*");
patternWithWildcards = string.Concat("^", patternWithWildcards.Replace("\\?", "."), "$");
return patternWithWildcards;
}
}

Regular Expression to Extract the Url out of the Anchor Tag

I want to extract the http link from inside the anchor tags? The extension that should be extracted should be WMV files only.
Because HTML's syntactic rules are so loose, it's pretty difficult to do with any reliability (unless, say, you know for absolute certain that all your tags will use double quotes around their attribute values). Here's some fairly general regex-based code for the purpose:
function extract_urls($html) {
$html = preg_replace('<!--.*?-->', '', $html);
preg_match_all('/<a\s+[^>]*href="([^"]+)"[^>]*>/is', $html, $matches);
foreach($matches[1] as $url) {
$url = str_replace('&', '&', trim($url));
if(preg_match('/\.wmv\b/i', $url) && !in_array($url, $urls))
$urls[] = $url;
}
preg_match_all('/<a\s+[^>]*href=\'([^\']+)\'[^>]*>/is', $html, $matches);
foreach($matches[1] as $url) {
$url = str_replace('&', '&', trim($url));
if(preg_match('/\.wmv\b/i', $url) && !in_array($url, $urls))
$urls[] = $url;
}
preg_match_all('/<a\s+[^>]*href=([^"\'][^> ]*)[^>]*>/is', $html, $matches);
foreach($matches[1] as $url) {
$url = str_replace('&', '&', trim($url));
if(preg_match('/\.wmv\b/i', $url) && !in_array($url, $urls))
$urls[] = $url;
}
return $urls;
}
Regex:
<a\\s*href\\s*=\\s*(?:(\"|\')(?<link>[^\"]*.wmv)(\"|\'))\\s*>(?<name>.*)\\s*</a>
[Note: \s* is used in several places to match the extra white space characters that can occur in the html.]
Sample C# code:
/// <summary>
/// Assigns proper values to link and name, if the htmlId matches the pattern
/// Matches only for .wmv files
/// </summary>
/// <returns>true if success, false otherwise</returns>
public static bool TryGetHrefDetailsWMV(string htmlATag, out string wmvLink, out string name)
{
wmvLink = null;
name = null;
string pattern = "<a\\s*href\\s*=\\s*(?:(\"|\')(?<link>[^\"]*.wmv)(\"|\'))\\s*>(?<name>.*)\\s*</a>";
if (Regex.IsMatch(htmlATag, pattern))
{
Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
wmvLink = r.Match(htmlATag).Result("${link}");
name = r.Match(htmlATag).Result("${name}");
return true;
}
else
return false;
}
MyRegEx.TryGetHrefDetailsWMV("<td><a href='/path/to/file'>Name of File</a></td>",
out wmvLink, out name); // No match
MyRegEx.TryGetHrefDetailsWMV("<td><a href='/path/to/file.wmv'>Name of File</a></td>",
out wmvLink, out name); // Match
MyRegEx.TryGetHrefDetailsWMV("<td><a href='/path/to/file.wmv' >Name of File</a></td>", out wmvLink, out name); // Match
I wouldn't do this with regex - I would probably use jQuery:
jQuery('a[href$=.wmv]').attr('href')
Compare this to chaos's simplified regex example, which (as stated) doesn't deal with fussy/complex markup, and you'll hopefully understand why a DOM parser is better than a regex for this type of problem.