Weird regular pattern behaviour in Python - regex

I have this following program in Python.
import re
data = '''component FA_8 is
port( a : in bit_vector(7 downto 0);
b: in bit_vector(7 downto 0);
s: out bit_vector(7 downto 0);
c: out bit);
end component;'''
m = re.search(r'''component\ +(\w+)\ +is[\ \n]+
port\ *[(]\ +''', data, re.I | re.VERBOSE)
if m:
print m.group()
else:
print "Cant find pattern"
I can't figure out why it is not working. If I change ending of regular pattern with port\ *[(]\ * then it matches.

If the quantifier is the only difference, then it means that there is no space in the text, could it be that it is a tab in the original string?
I would replace the escaped space by a whitespace \s. \s is matching a whitespace character, this is a space, a tab, \r and \n (and other whitespace characters)
m = re.search(r'''component\s+(\w+)\s+is\s+
port\s*[(]\s+''', data, re.I | re.VERBOSE)

Related

Replace all characters between two characters in python

I'm trying to replace all characters between two characters.
This is gonna be my input string:
P<HRVSPECIMEN<<SPECIMENC<<<<<<<K<K<K<K<KKKKKK\n10070070071HRVB212258F1407019<<<<<c<c<<<<<<06
And I am trying to get this output:
P<HRVSPECIMEN<<SPECIMENC<<<<<<<<<<<<<<<<<<<<<\n10070070071HRVB212258F1407019<<<<<<<<<<<<<<06
This regex should give you the results you want. It looks for a K or c that is preceded by a K, c or < and followed by a K, c or < or the end of line:
(?<=[Kc<])[Kc](?=[Kc<]|$)
You can use this with the re.MULTILINE flag to re.sub:
import re
s = '''P<HRVSPECIMEN<<SPECIMENC<<<<<<<K<K<K<K<KKKKKK
10070070071HRVB212258F1407019<<<<<c<c<<<<<<06'''
s = re.sub(r'(?<=[Kc<])[Kc](?=[Kc<]|$)', '<', s, 0, re.MULTILINE)
print(s)
Output:
P<HRVSPECIMEN<<SPECIMENC<<<<<<<<<<<<<<<<<<<<<
10070070071HRVB212258F1407019<<<<<<<<<<<<<<06
If the \n in your string is a literal \n rather than a newline, just replace $ in the regex with \\n:
s = r'P<HRVSPECIMEN<<SPECIMENC<<<<<<<K<K<K<K<KKKKKK\n10070070071HRVB212258F1407019<<<<<c<c<<<<<<06'
s = re.sub(r'(?<=[Kc<])[Kc](?=[Kc<]|\\n)', '<', s, 0)
print(s)
Output:
P<HRVSPECIMEN<<SPECIMENC<<<<<<<<<<<<<<<<<<<<<\n10070070071HRVB212258F1407019<<<<<<<<<<<<<<06
Demo on rextester

Python Regex - extracting the sentence that contains asterisk

test_string: '**Amount** : $25k **Name** : James **Excess** : None Returned \n **In Suit?** Y **Venue** : SF **Insurance** : N/A \n **FTSA** : None listed'
import re
regex = r"(?:^|[^.?*,!-]*(?<=[.?\s*,!-]))(n/a)(?=[\s.?*!,-])[^.?*,!-]*[.?*,!-]"
subst = ""
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE | re.MULTILINE)
I tried to extract '**Insurance** : N/A' from the string. But my above code doesn't work. How can I make it?
Thanks in advance!
I would treat the content like a (semi-structured) key-value file format.
You can match the key-value pairs with a regex like this:
(\*\*[a-zA-Y ?]+\*\*) : ((?:(?!\*\*).)*)(?= |$)
Demo
Explanation:
(\*\*[a-zA-Y ?]+\*\*) the key: you may have to adjust the character range
: the kv separator with surrounded by spaces
((?:(?!\*\*).)*) the value is captured with a tempered greedy token: everything but literal ** followed by (?= |$) the end of string $ or a separating space.
(?= |$)
Sample Code:
import re
regex = r"(\*\*[a-zA-Z ?]+\*\*) : ((?:(?!\*\*).)*)(?= |$)"
test_str = "**Amount** : $25k **Name** : James **Excess** : None Returned \\n **In Suit?** : Y **Venue** : SF **Insurance** : N/A \\n **FTSA** : None listed"
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
if match.group(1) == "**Insurance**":
print (match.group(2))

Replacement of strings within 2 strings in regex

I have a string:
dkj a * & &*(&(*(
//#HELLO
^%#&UJNWDUK()C*(v 8*J DK*9
//#HE#$^&&(akls#$98akdjl ak##sjdkja
//
%^&*(//#HELLO//#BYE<><>
//#BYE
^%#&UJNWDUK()C*(v 8*J DK*90K )
//#HELLO
&*^J$XUK 8j8 j jk kk8(&*(
//#BYE
and I need to have 2 groups such as each group must start with //HELLO then there should be a next line and any type of text can follow (.*) but it will end with a //BYE preceded by a line:
1)
//#HELLO
^%#&UJNWDUK()C*(v 8*J DK*9
//#HE#$^&&(akls#$98akdjl ak##sjdkja
//
%^&*(//#HELLO//#BYE<><>
//#BYE
2)
//#HELLO
&*^J$XUK 8j8 j jk kk8(&*(
//#BYE
and replaces the original string to this: (basically adding // to each line of each group)
dkj a * & &*(&(*(
////#HELLO
//^%#&UJNWDUK()C*(v 8*J DK*9
////#HE#$^&&(akls#$98akdjl ak##sjdkja
////
//%^&*(//#HELLO//#BYE<><>
////#BYE
^%#&UJNWDUK()C*(v 8*J DK*90K )
////#HELLO
//&*^J$XUK 8j8 j jk kk8(&*(
////#BYE
Here is my current progress:
I have
\/\/#HELLO\n.*?\/\/#BYE[\n$]
However im not sure how to go about the replacement, I'm thinking separating each line per group using \G after the //#HELLO and ending with //#BYE
It's a bit complex, but this will do it:
Search: (?m)(//#HELLO[\r\n]+|\G(?://#BYE|(?=(?:[^#]|#(?!HELLO[\r\n]+))*#BYE)[^\r\n]*[\r\n]*))
Replace: //$1
In Groovy:
String resultString = subjectString.replaceAll(/(?m)(\/\/#HELLO[\r\n]+|\G(?:\/\/#BYE|(?=(?:[^#]|#(?!HELLO[\r\n]+))*#BYE)[^\r\n]*[\r\n]*))/, '//$1');
For grouping into separate lines use the following regex:
//#HELLO\r(.*[\n\r]+)*//#BYE\r?
\r - Newline character
[\n\r] - Enter characters
*? - Non-greedy match
?- Match 1 or 0 times
You can take out the ? at the end if it always ends with a newline.
You can then use the group (The value inside the brackets) to search and replace.

regular expression which should allow limited special characters

Can any one tell me the regular expression for textfield which should not allow following characters and can accept other special characters,alphabets,numbers and so on :
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ # &
this will not allow string that contains any of the characters in any part of the string mentioned above.
^(?!.*[+\-&|!(){}[\]^"~*?:#&]+).*$
See Here
Brief Explanation
Assert position at the beginning of a line (at beginning of the string or after a line break character) ^
Assert that it is impossible to match the regex below starting at this position (negative lookahead) (?!.*[+\-&|!(){}[\]^"~*?:#&]+)
Match any single character that is not a line break character .*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match a single character present in the list below [+\-&|!(){}[\]^"~*?:#&]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
The character "+" +
A "-" character \-
One of the characters &|!(){}[” «&|!(){}[
A "]" character \]
One of the characters ^"~*?:#&” «^"~*?:#&
Match any single character that is not a line break character .*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Assert position at the end of a line (at the end of the string or before a line break character) $
Its usually better to whitelist characters you allow, rather than to blacklist characters you don't allow. both from a security standpoint, and from an ease of implementation standpoint.
If you do go down the blacklist route, here is an example, but be warned, the syntax is not simple.
http://groups.google.com/group/regex/browse_thread/thread/0795c1b958561a07
If you want to whitelist all the accent characters, perhaps using unicode ranges would help? Check out this link.
http://www.regular-expressions.info/unicode.html
I recognize those as the characters which need to be escaped for Solr. If this is the case, and if you are coding in PHP, then you should use my PHP utility functions from Github. Here is one of the Solr functions from there:
/**
* Escape values destined for Solr
*
* #author Dotan Cohen
* #version 2013-05-30
*
* #param value to be escaped. Valid data types: string, array, int, float, bool
* #return Escaped string, NULL on invalid input
*/
function solr_escape($str)
{
if ( is_array($str) ) {
foreach ( $str as &$s ) {
$s = solr_escape($s);
}
return $str;
}
if ( is_int($str) || is_float($str) || is_bool($str) ) {
return $str;
}
if ( !is_string($str) ) {
return NULL;
}
$str = addcslashes($str, "+-!(){}[]^\"~*?:\\");
$str = str_replace("&&", "\\&&", $str);
$str = str_replace("||", "\\||", $str);
return $str;
}

Match whole word (Visual Studio style)

I am trying to add Match Whole Word search to my small application.
I want it to do the same thing that Visual Studio is doing.
So for example, below code should work fine:
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
String input = "[ abc() *abc ]";
Match(input, "abc", 2);
Match(input, "abc()", 1);
Match(input, "*abc", 1);
Match(input, "*abc ", 1);
}
private void Match(String input, String pattern, int expected)
{
String escapedPattern = Regex.Escape(pattern);
MatchCollection mc = Regex.Matches(input, #"\b" + escapedPattern + #"\b", RegexOptions.IgnoreCase);
if (mc.Count != expected)
{
throw new Exception("match whole word isn't working");
}
}
}
Searching for "abc" works fine but other patterns return 0 results.
I think \b is inadequate but i am not sure what to use.
Any help would be appreciated.
Thanks
The \b metacharacter matches on a word-boundary between an alphanumeric and non-alphanumeric character. The strings that end with non-alphanumeric characters end up failing to match since \b is working as expected.
To perform a proper whole word match that supports both types of data you need to:
use \b before or after any alphanumeric character
use \B (capital B) before or after any non-alphanumeric character
not use \B if the first or last character of the pattern is intentionally a non-alphanumeric character, such as your final example with a trailing space
Based on these points you need to have additional logic to check the incoming search term to shape it into the appropriate pattern. \B works in the opposite manner of \b. If you don't use \B then you could incorrectly end up with partial matches. For example, the word foo*abc would incorrectly be matched with a pattern of #"\*abc\b".
To demonstrate:
string input = "[ abc() *abc foo*abc ]";
string[] patterns =
{
#"\babc\b", // 3
#"\babc\(\)\B", // 1
#"\B\*abc\b", // 1, \B prefix ensures whole word match, "foo*abc" not matched
#"\*abc\b", // 2, no \B prefix so it matches "foo*abc"
#"\B\*abc " // 1
};
foreach (var pattern in patterns)
{
Console.WriteLine("Pattern: " + pattern);
var matches = Regex.Matches(input, pattern);
Console.WriteLine("Matches found: " + matches.Count);
foreach (Match match in matches)
{
Console.WriteLine(" " + match.Value);
}
Console.WriteLine();
}
I think this is what you're looking for:
#"(?<!\w)" + escapedPattern + #"(?!\w)"
\b is defined in terms of the presence or absence of "word" characters both before and after the current position. You only care about the what's before the first character and what's after the last one.
The \b is a zero-width assertion that matches between a word character and a non-word character.
Letters, digits and underscores are word characters. *, SPACE, and parens are non-word characters. therefore, when you use \b*abc\b as your pattern, it does not match your input, because * is non-word. Likewise for your pattern involving parens.
To solve this,
You will need to eliminate the \b in cases where your input (unescaped) pattern begins or ends with non-word characters.
public void Run()
{
String input = "[ abc() *abc ]";
Match(input, #"\babc\b", 2);
Match(input, #"\babc\(\)", 1);
Match(input, #"\*abc\b", 1);
Match(input, #"\*abc\b ", 1);
}
private void Match(String input, String pattern, int expected)
{
MatchCollection mc = Regex.Matches(input, pattern, RegexOptions.IgnoreCase);
Console.WriteLine((mc.Count == expected)? "PASS ({0}=={1})" : "FAIL ({0}!={1})",
mc.Count, expected);
}