How can I parse a filename using a regular expressions in Python? - regex

I've got the following filename: aabbcc_id_1112233.png which translates to the following regexp: [A-Za-z0-9]_id_[0-9].png where [x] means it may contain >= 1 symbols of x. How can I put it into python regexp library to return tuple: (id, id_name)?
E.g., for aabbcc22_id_123.png I want to receive (id, id_name) = ('aabbcc22', 'id_123').
The usecase: currently I do .split() by an underscore which is hacky since I have to use indexes:
base = filename.split('.')[0]
return (base.split('_')[0], '_'.join(base.split('_')[1:]))

This will do the job:
>>> import re
>>> get_id = re.compile('(.*)_(id_.*)[.]png')
>>> get_id.findall('aabbcc22_id_123.png')
[('aabbcc22', 'id_123')]
>>>
And you can assign the values to id and id_name variables using this:
>>> [(id, id_name)] = get_id.findall('aabbcc22_id_123.png')
>>> id, id_name
('aabbcc22', 'id_123')
>>>

Related

Python - Check If string Is In bigger String

I'm working with Python v2.7, and I'm trying to find out if you can tell if a word is in a string.
If for example i have a string and the word i want to find:
str = "ask and asked, ask are different ask. ask"
word = "ask"
How should i code so that i know that the result i obtain doesn't include words that are part of other words. In the example above i want all the "ask" except the one "asked".
I have tried with the following code but it doesn't work:
def exact_Match(str1, word):
match = re.findall(r"\\b" + word + "\\b",str1, re.I)
if len(match) > 0:
return True
return False
Can someone please explain how can i do it?
You can use the following function :
>>> test_str = "ask and asked, ask are different ask. ask"
>>> word = "ask"
>>> def finder(s,w):
... return re.findall(r'\b{}\b'.format(w),s,re.U)
...
>>> finder(text_str,word)
['ask', 'ask', 'ask', 'ask']
Note that you need \b for boundary regex!
Or you can use the following function to return the indices of words :
in splitted string :
>>> def finder(s,w):
... return [i for i,j in enumerate(re.findall(r'\b\w+\b',s,re.U)) if j==w]
...
>>> finder(test_str,word)
[0, 3, 6, 7]

Creating a feature dictionary for Python machine learning (naive bayes) algorithm

I would like predict, for example, Chinese vs. non-Chinese ethnicities using last names. Particularly I want to extract three-letter substrings from the last names. So for example, the last name "gao" will give one feature as "gao" while "chan" will give two features as "cha" and "han".
The splitting is successfully done in the three_split function below. But as far as I understand, to incorporate this as a feature set I need to return the output as dictionary. Any hints of how to do that? For the dictionary of "Chan", the dictionary should return "cha" and "han" as TRUE.
from nltk.classify import PositiveNaiveBayesClassifier
import re
chinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee']
nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis']
def three_split(word):
word = word.lower()
word = word.replace(" ", "_")
split = 3
return [word[start:start+split] for start in range(0, len(word)-2)]
positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
unlabeled_featuresets)
print three_split("Jim Silva")
print classifier.classify(three_split("Jim Silva"))
Here's a white-box answer:
Using your orginal code, it outputs:
Traceback (most recent call last):
File "test.py", line 17, in <module>
unlabeled_featuresets)
File "/usr/local/lib/python2.7/dist-packages/nltk/classify/positivenaivebayes.py", line 108, in train
for fname, fval in featureset.items():
AttributeError: 'list' object has no attribute 'items'
Looking at line 17:
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
unlabeled_featuresets)
It seems that the PositiveNaiveBayesClassifier requires an object that has an attribute '.items()' and intuitively it should be a dict if the NLTK code is pythonic.
Looking at https://github.com/nltk/nltk/blob/develop/nltk/classify/positivenaivebayes.py#L88, there isn't any clear explanation of what the positive_featuresets parameter should contain:
:param positive_featuresets: A list of featuresets that are known as
positive examples (i.e., their label is True).
Checking the docstring, we see this example:
Example:
>>> from nltk.classify import PositiveNaiveBayesClassifier
Some sentences about sports:
>>> sports_sentences = [ 'The team dominated the game',
... 'They lost the ball',
... 'The game was intense',
... 'The goalkeeper catched the ball',
... 'The other team controlled the ball' ]
Mixed topics, including sports:
>>> various_sentences = [ 'The President did not comment',
... 'I lost the keys',
... 'The team won the game',
... 'Sara has two kids',
... 'The ball went off the court',
... 'They had the ball for the whole game',
... 'The show is over' ]
The features of a sentence are simply the words it contains:
>>> def features(sentence):
... words = sentence.lower().split()
... return dict(('contains(%s)' % w, True) for w in words)
We use the sports sentences as positive examples, the mixed ones ad unlabeled examples:
>>> positive_featuresets = list(map(features, sports_sentences))
>>> unlabeled_featuresets = list(map(features, various_sentences))
>>> classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
... unlabeled_featuresets)
Now we find the feature() function that converts the sentences into features and returns
dict(('contains(%s)' % w, True) for w in words)
Basically this is the thing that has the ability to call .items(). Looking at the dict comprehension it seems like 'contains(%s)' % w is a little redundant unless it's for human readability. So you could have just used dict((w, True) for w in words).
Also, the replacement of space with underscore might also be redundant unless there's use for it later on.
Lastly, the slicing and limited iteration could have been replaces with the ngram function that can extract character ngrams, e.g.
>>> word = 'alexgao'
>>> split=3
>>> [word[start:start+split] for start in range(0, len(word)-2)]
['ale', 'lex', 'exg', 'xga', 'gao']
# With ngrams
>>> from nltk.util import ngrams
>>> ["".join(ng) for ng in ngrams(word,3)]
['ale', 'lex', 'exg', 'xga', 'gao']
Your feature extraction function could have been simplified as such:
from nltk.util import ngrams
def three_split(word):
return dict(("".join(ng, True) for ng in ngrams(word.lower(),3))
[out]:
{'im ': True, 'm s': True, 'jim': True, 'ilv': True, ' si': True, 'lva': True, 'sil': True}
False
In fact, NLTK classifiers are so versatile that you can use tuples of characters as features so you don't need to patch the ngram up when extracting the features, i.e.:
from nltk.classify import PositiveNaiveBayesClassifier
import re
from nltk.util import ngrams
chinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee']
nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis']
def three_split(word):
return dict(((ng, True) for ng in ngrams(word.lower(),3))
positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
unlabeled_featuresets)
print three_split("Jim Silva")
print classifier.classify(three_split("Jim Silva"))
[out]:
{('m', ' ', 's'): True, ('j', 'i', 'm'): True, ('s', 'i', 'l'): True, ('i', 'l', 'v'): True, (' ', 's', 'i'): True, ('l', 'v', 'a'): True, ('i', 'm', ' '): True}
With some trial and error, I think I've got it. Thanks.
from nltk.classify import PositiveNaiveBayesClassifier
import re
chinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee']
nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis']
def three_split(word):
word = word.lower()
word = word.replace(" ", "_")
split = 3
return dict(("contains(%s)" % word[start:start+split], True)
for start in range(0, len(word)-2))
positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
unlabeled_featuresets)
name = "dennis kidd"
print three_split(name)
print classifier.classify(three_split(name))

python 2.7 print accents inside a list [duplicate]

If you have a string as below, with unicode chars, you can print it, and get the unescaped version:
>>> s = "äåö"
>>> s
'\xc3\xa4\xc3\xa5\xc3\xb6'
>>> print s
äåö
but if we have a list containing the string above and print it:
>>> s = ['äåö']
>>> s
['\xc3\xa4\xc3\xa5\xc3\xb6']
>>> print s
['\xc3\xa4\xc3\xa5\xc3\xb6']
You still get escaped character sequences. How do you go about to get the content of the list unescaped, is it possible? Like this:
>>> print s
['äåö']
Also, if the strings are of the unicode type, how do you go about doing the same as above?
>>> s = u'åäö'
>>> s
u'\xe5\xe4\xf6'
>>> print s
åäö
>>> s = [u'åäö']
>>> s
[u'\xe5\xe4\xf6']
>>> print s
[u'\xe5\xe4\xf6']
When you print a string, you get the output of the __str__ method of the object - in this case the string without quotes. The __str__ method of a list is different, it creates a string containing the opening and closing [] and the string produced by the __repr__ method of each object contained within. What you're seeing is the difference between __str__ and __repr__.
You can build your own string instead:
print '[' + ','.join("'" + str(x) + "'" for x in s) + ']'
This version should work on both Unicode and byte strings in Python 2:
print u'[' + u','.join(u"'" + unicode(x) + u"'" for x in s) + u']'
Is this satisfactory?
>>> s = ['äåö', 'äå']
>>> print "\n".join(s)
äåö
äå
>>> print ", ".join(s)
äåö, äå
>>> s = [u'åäö']
>>> print ",".join(s)
åäö
In Python 2.x the default is what you're experiencing:
>>> s = ['äåö']
>>> s
['\xc3\xa4\xc3\xa5\xc3\xb6']
In Python 3, however, it displays properly:
>>> s = ['äåö']
>>> s
['äåö']
Another solution
s = ['äåö', 'äå']
encodedlist=', '.join(map(unicode, s))
print(u'[{}]'.format(encodedlist).encode('UTF-8'))
gives
[äåö, äå]
One can use this wrapper class:
#!/usr/bin/python
# -*- coding: utf-8 -*-
class ReprToStrString(str):
def __repr__(self):
return "'" + self.__str__() + "'"
class ReprToStr(object):
def __init__(self, printable):
if isinstance(printable, str):
self._printable = ReprToStrString(printable)
elif isinstance(printable, list):
self._printable = list([ReprToStr(item) for item in printable])
elif isinstance(printable, dict):
self._printable = dict(
[(ReprToStr(key), ReprToStr(value)) for (key, value) in printable.items()])
else:
self._printable = printable
def __repr__(self):
return self._printable.__repr__()
russian1 = ['Валенки', 'Матрёшка']
print russian1
# Output:
# ['\xd0\x92\xd0\xb0\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xba\xd0\xb8', '\xd0\x9c\xd0\xb0\xd1\x82\xd1\x80\xd1\x91\xd1\x88\xd0\xba\xd0\xb0']
print ReprToStr(russian1)
# Output:
# ['Валенки', 'Матрёшка']
russian2 = {'Валенки': 145, 'Матрёшка': 100500}
print russian2
# Output:
# {'\xd0\x92\xd0\xb0\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xba\xd0\xb8': 145, '\xd0\x9c\xd0\xb0\xd1\x82\xd1\x80\xd1\x91\xd1\x88\xd0\xba\xd0\xb0': 100500}
print ReprToStr(russian2)
# Output:
# {'Матрёшка': 100500, 'Валенки': 145}

Why does Python add extra quotes in my array returned by split()?

Here is the relevant data being parsed:
alternateClassName: 'FutureSurvey',
alternateClassName: ['HardwareSurvey'],
alternateClassName: ['OptimismSurvey', 'OptimismSurveyTwo']
Here is my regex:
alternate_regex = re.compile('.*?alternateClassName\s*:\s*(\[\s*(.*?)\s*\]|[\'\"]\s*(.*?)\s*[\'\"]).*', re.M)
And here is my code:
alternate_match = alternate_regex.match(line)
if alternate_match and alternate_match.group and alternate_match.group(1):
alternateList = alternate_match.group(1).strip().split(',')
print alternateList
dependent_mapping[classpathTxt]['alternateList'] = alternateList
Here is what gets printed:
["'FutureSurvey'"]
["['HardwareSurvey']"]
["['OptimismSurvey',", "'OptimismSurveyTwo']"]
I would have expected this:
['FutureSurvey']
['HardwareSurvey']
['OptimismSurvey', 'OptimismSurveyTwo']
Anyone know what's going on?
Your .strip() isn't doing anything, because it doesn't have a parameter. Instead, replace it with .strip("'")
>>> x = "'hello'"
>>> x.strip()
"'hello'"
>>> x.strip("'")
'hello'
>>>

Python letter swapping

I'm making a program that scrambles words for fun and I've hit a roadblock. I am attempting to switch all the letters in a string and I'm not quite sure how to go about it (hello = ifmmp). I've looked all around and haven't been able to find any answers to this specific question. Any help would be great!
You want a simple randomized cypher? The following will work for all lowercase inputs, and can easily be extended.
import random
import string
swapped = list(string.lowercase)
random.shuffle(swapped)
cipher = string.maketrans(string.lowercase, ''.join(swapped))
def change(val):
return string.translate(val, cipher)
You can probably modify this example to achieve what you need. Here every vowel in a string is replaced by its vowel position:
from string import maketrans # Required to call maketrans function.
intab = "aeiou"
outtab = "12345"
trantab = maketrans(intab, outtab)
str = "this is string example....wow!!!";
print str.translate(trantab);
# this is the output
"th3s 3s str3ng 2x1mpl2....w4w!!!"
Try maketrans in combination with the string.translate function. This code removes letters from your word from the letters you are scrambling with first. If you just want lowercase only use string.lowercase instead of string.letters.
>>> import string, random
>>> letters = list(string.letters)
>>> random.shuffle(letters)
>>> letters = "".join(letters)
>>> word = 'hello'
>>> for letter in word:
... letters = letters.replace(letter, '')
...
>>> transtab = string.maketrans(word, letters[:len(word)])
>>> print word.translate(transtab)
XQEEN
The "scrambling" you appear to be after is called Caesar's cipher, with a right shift of 1. The following Python will achieve what you're after:
def caesar(str):
from string import maketrans
fromalpha = "abcdefghijklmnopqrstuvwxyz"
# Move the last 1 chars to the start of the string
toalpha = fromalpha[1:] + fromalpha[:1]
# Make it work with capital letters
fromalpha += fromalpha.upper()
toalpha += toalpha.upper()
x = maketrans(fromalpha, toalpha)
return str.translate(x)
If you're interested in the general case, this function will do the job. (Note that it is conventional to express Caesar ciphers in terms of left shifts, rather than right.)
def caesar(str, lshift):
from string import maketrans
fromalpha = "abcdefghijklmnopqrstuvwxyz"
toalpha = fromalpha[-lshift:] + fromalpha[:-lshift]
fromalpha += fromalpha.upper()
toalpha += toalpha.upper()
x = maketrans(fromalpha, toalpha)
return str.translate(x)