I've got the following filename: aabbcc_id_1112233.png which translates to the following regexp: [A-Za-z0-9]_id_[0-9].png where [x] means it may contain >= 1 symbols of x. How can I put it into python regexp library to return tuple: (id, id_name)?
E.g., for aabbcc22_id_123.png I want to receive (id, id_name) = ('aabbcc22', 'id_123').
The usecase: currently I do .split() by an underscore which is hacky since I have to use indexes:
base = filename.split('.')[0]
return (base.split('_')[0], '_'.join(base.split('_')[1:]))
This will do the job:
>>> import re
>>> get_id = re.compile('(.*)_(id_.*)[.]png')
>>> get_id.findall('aabbcc22_id_123.png')
[('aabbcc22', 'id_123')]
>>>
And you can assign the values to id and id_name variables using this:
>>> [(id, id_name)] = get_id.findall('aabbcc22_id_123.png')
>>> id, id_name
('aabbcc22', 'id_123')
>>>
I'm having trouble getting regex to remove words that contain digits and letters. I keep getting "TypeError: expected string or buffer" Any help you can provide will be greatly appreciated.
$ testing abc
sorted_word = re.sub("\S+\d\S+", "", word_sort).strip()
File "/usr/lib64/python2.6/re.py", line 151, in sub
return _compile(pattern, 0).sub(repl, string, count)
TypeError: expected string or buffer
#! /usr/bin/env python
import os
import sys
import re
in_list = sys.argv
def word_sort(in_list):
word_sort = " 1a "
word_sort = sorted(in_list[1:], key=len)
for i in word_sort:
punctuation = '.',',',';','!',' / ','"','?' #strips punctuation from words
if i in punctuation: #removes punctuation
word_sort = word_sort.replace(i," ")
word_sort= sorted(word_sort, key=lambda L: (L.lower(), L))
sorted_word = " 1a "
sorted_word = re.sub("\S+\d\S+", "", word_sort).strip()
return sorted_word
print (word_sort(in_list))
this:
word_sort= sorted(word_sort, key=lambda L: (L.lower(), L))
iterates through the word_sort string but doesn't create a str object, just a list of the sorted characters, so re module chokes on it.
You have to join the characters again to recompose the string:
word_sort= "".join(sorted(word_sort, key=lambda L: (L.lower(), L)))
small tester:
>>> sorted("dcba")
['a', 'b', 'c', 'd']
>>> "".join(sorted("dcba"))
'abcd'
BTW: you should avoid to call the function and the local variables with the same name word_sort. It's difficult to read. And fortunately you don't call your function recursively :)
I use python 2.7 (I cannot use 3.4),
text = """
saú$_ß$¤×÷asd县阴őasdCharacters: \"县阴 asdsadsasd县阴
"""
text = unicode(text, "utf-8")
print("Method 1\n")
reg = "Characters: \"[\u4e00-\u9fff]+.*?"
reg = unicode(reg, "utf-8")
pattern = re.compile(reg, re.UNICODE | re.MULTILINE)
for m in re.findall(pattern, text): # Number of occurrences in the 'k' line.
print("Results: %s" % m.encode(sys.stdout.encoding, errors='replace'))
print("Method 2\n")
reg = u"Characters: \"[\u4e00-\u9fff]+.*?"
pattern = re.compile(reg, re.UNICODE | re.MULTILINE)
for m in re.findall(pattern, text): # Number of occurrences in the 'k' line.
print("Results: %s" % m.encode(sys.stdout.encoding, errors='replace'))
The output is:
Method 1
Method 2
Results: Characters: "??
The question is how can I make the method 2 result with variables. I didn't find any solution yet and I don't understand why the method 1 doesn't work.
Thanks for any suggestion.
Method 1 does not work because \u#### doesn't mean anything in the case of an encoded sequence. Instead, you need the correct sequence in bytes. If you do this, then method 1 will produce the same results as method 2. I modified your code as follows:
# -*- coding: utf-8 -*-
import sys
import re
text = """
saú$_ß$¤×÷asd县阴őasdCharacters: \"县阴 asdsadsasd县阴
"""
text = unicode(text, "utf-8")
print("\nMethod 1\n")
reg = "Characters: \"[\xe4\xb8\x80-\xe9\xbf\xbf]+.*?"
reg = unicode(reg, "utf-8")
pattern = re.compile(reg, re.UNICODE | re.MULTILINE)
for m in re.findall(pattern, text): # Number of occurrences in the 'k' line.
print("Results: %s" % m.encode(sys.stdout.encoding, errors='replace'))
print("\nMethod 2\n")
reg = u"Characters: \"[\u4e00-\u9fff]+.*?"
pattern = re.compile(reg, re.UNICODE | re.MULTILINE)
for m in re.findall(pattern, text): # Number of occurrences in the 'k' line.
print("Results: %s" % m.encode(sys.stdout.encoding, errors='replace'))
It produces the following results on my machine:
Method 1
Results: Characters: "县阴
Method 2
Results: Characters: "县阴
I would like predict, for example, Chinese vs. non-Chinese ethnicities using last names. Particularly I want to extract three-letter substrings from the last names. So for example, the last name "gao" will give one feature as "gao" while "chan" will give two features as "cha" and "han".
The splitting is successfully done in the three_split function below. But as far as I understand, to incorporate this as a feature set I need to return the output as dictionary. Any hints of how to do that? For the dictionary of "Chan", the dictionary should return "cha" and "han" as TRUE.
from nltk.classify import PositiveNaiveBayesClassifier
import re
chinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee']
nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis']
def three_split(word):
word = word.lower()
word = word.replace(" ", "_")
split = 3
return [word[start:start+split] for start in range(0, len(word)-2)]
positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
unlabeled_featuresets)
print three_split("Jim Silva")
print classifier.classify(three_split("Jim Silva"))
Here's a white-box answer:
Using your orginal code, it outputs:
Traceback (most recent call last):
File "test.py", line 17, in <module>
unlabeled_featuresets)
File "/usr/local/lib/python2.7/dist-packages/nltk/classify/positivenaivebayes.py", line 108, in train
for fname, fval in featureset.items():
AttributeError: 'list' object has no attribute 'items'
Looking at line 17:
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
unlabeled_featuresets)
It seems that the PositiveNaiveBayesClassifier requires an object that has an attribute '.items()' and intuitively it should be a dict if the NLTK code is pythonic.
Looking at https://github.com/nltk/nltk/blob/develop/nltk/classify/positivenaivebayes.py#L88, there isn't any clear explanation of what the positive_featuresets parameter should contain:
:param positive_featuresets: A list of featuresets that are known as
positive examples (i.e., their label is True).
Checking the docstring, we see this example:
Example:
>>> from nltk.classify import PositiveNaiveBayesClassifier
Some sentences about sports:
>>> sports_sentences = [ 'The team dominated the game',
... 'They lost the ball',
... 'The game was intense',
... 'The goalkeeper catched the ball',
... 'The other team controlled the ball' ]
Mixed topics, including sports:
>>> various_sentences = [ 'The President did not comment',
... 'I lost the keys',
... 'The team won the game',
... 'Sara has two kids',
... 'The ball went off the court',
... 'They had the ball for the whole game',
... 'The show is over' ]
The features of a sentence are simply the words it contains:
>>> def features(sentence):
... words = sentence.lower().split()
... return dict(('contains(%s)' % w, True) for w in words)
We use the sports sentences as positive examples, the mixed ones ad unlabeled examples:
>>> positive_featuresets = list(map(features, sports_sentences))
>>> unlabeled_featuresets = list(map(features, various_sentences))
>>> classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
... unlabeled_featuresets)
Now we find the feature() function that converts the sentences into features and returns
dict(('contains(%s)' % w, True) for w in words)
Basically this is the thing that has the ability to call .items(). Looking at the dict comprehension it seems like 'contains(%s)' % w is a little redundant unless it's for human readability. So you could have just used dict((w, True) for w in words).
Also, the replacement of space with underscore might also be redundant unless there's use for it later on.
Lastly, the slicing and limited iteration could have been replaces with the ngram function that can extract character ngrams, e.g.
>>> word = 'alexgao'
>>> split=3
>>> [word[start:start+split] for start in range(0, len(word)-2)]
['ale', 'lex', 'exg', 'xga', 'gao']
# With ngrams
>>> from nltk.util import ngrams
>>> ["".join(ng) for ng in ngrams(word,3)]
['ale', 'lex', 'exg', 'xga', 'gao']
Your feature extraction function could have been simplified as such:
from nltk.util import ngrams
def three_split(word):
return dict(("".join(ng, True) for ng in ngrams(word.lower(),3))
[out]:
{'im ': True, 'm s': True, 'jim': True, 'ilv': True, ' si': True, 'lva': True, 'sil': True}
False
In fact, NLTK classifiers are so versatile that you can use tuples of characters as features so you don't need to patch the ngram up when extracting the features, i.e.:
from nltk.classify import PositiveNaiveBayesClassifier
import re
from nltk.util import ngrams
chinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee']
nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis']
def three_split(word):
return dict(((ng, True) for ng in ngrams(word.lower(),3))
positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
unlabeled_featuresets)
print three_split("Jim Silva")
print classifier.classify(three_split("Jim Silva"))
[out]:
{('m', ' ', 's'): True, ('j', 'i', 'm'): True, ('s', 'i', 'l'): True, ('i', 'l', 'v'): True, (' ', 's', 'i'): True, ('l', 'v', 'a'): True, ('i', 'm', ' '): True}
With some trial and error, I think I've got it. Thanks.
from nltk.classify import PositiveNaiveBayesClassifier
import re
chinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee']
nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis']
def three_split(word):
word = word.lower()
word = word.replace(" ", "_")
split = 3
return dict(("contains(%s)" % word[start:start+split], True)
for start in range(0, len(word)-2))
positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
unlabeled_featuresets)
name = "dennis kidd"
print three_split(name)
print classifier.classify(three_split(name))
Here is the relevant data being parsed:
alternateClassName: 'FutureSurvey',
alternateClassName: ['HardwareSurvey'],
alternateClassName: ['OptimismSurvey', 'OptimismSurveyTwo']
Here is my regex:
alternate_regex = re.compile('.*?alternateClassName\s*:\s*(\[\s*(.*?)\s*\]|[\'\"]\s*(.*?)\s*[\'\"]).*', re.M)
And here is my code:
alternate_match = alternate_regex.match(line)
if alternate_match and alternate_match.group and alternate_match.group(1):
alternateList = alternate_match.group(1).strip().split(',')
print alternateList
dependent_mapping[classpathTxt]['alternateList'] = alternateList
Here is what gets printed:
["'FutureSurvey'"]
["['HardwareSurvey']"]
["['OptimismSurvey',", "'OptimismSurveyTwo']"]
I would have expected this:
['FutureSurvey']
['HardwareSurvey']
['OptimismSurvey', 'OptimismSurveyTwo']
Anyone know what's going on?
Your .strip() isn't doing anything, because it doesn't have a parameter. Instead, replace it with .strip("'")
>>> x = "'hello'"
>>> x.strip()
"'hello'"
>>> x.strip("'")
'hello'
>>>