I have a python code for word frequency count from a text file. The problem with the program is that it takes fullstop into account hence altering the count. For counting word i've used a sorted list of words. I tried to remove the fullstop using
words = open(f, 'r').read().lower().split()
uniqueword = sorted(set(words))
uniqueword = uniqueword.replace(".","")
but i get error as
AttributeError: 'list' object has no attribute 'replace'
Any help would be appreciated :)
You can process the words before you make the set, using a list comprehension:
words = [word.replace(".", "") for word in words]
You could also remove them after (uniquewords = [word.replace...]), but then you will reintroduce duplicates.
Note that if you want to count these words, a Counter may be more useful:
from collections import Counter
counts = Counter(words)
You might be better off with
words = re.findall(r'\w+', open(f, 'r').read().lower())
which will grab all the strings composed of one or more “word characters” and will ignore punctuation and whitespace.
Related
My intention is to remove unnecessary newline characters from strings inside of a DataFrame.
Example:
import pandas as pd
data = ['I like this product\n\nThe product is good']
dataf = pd.DataFrame(data)
Original data:
I like this product
The product is good
I tried the following, which was not successful, since all of the newline characters were removed, whereas I wanted to keep one of them.
dataf['new'] = dataf.replace('\\n','', regex=True)
The result was this, all newline characters were removed:
I like this productThe product is good
The result I am trying to achieve is this:
I like this product
The product is good
This should work:
dataf['new'] = dataf.replace(r'(\n)+', r'\n', regex=True)
The + indicates one or more occurrences of the preceding pattern, and however many there are, they will all be replaced by just one newline character.
I'm trying to do a basic grab from a Text Column in sqlite and processing it to a field in a model that is a List<String>. If data is empty then I want to set it as an empty list []. However, when I do this for some reason I get a list that looks empty but really with a length of 1. To recreate this issue I simplified the issue with the following.
String stringList = '';
List<String> aList = [];
aList = stringList.split(',');
print(aList.length);
Why does this print 1? Shouldn't it return 0 since there are no values with a comma in it?
This should print 1.
When you split a string on commas, you are finding all the positions of commas in the string, then returning a list of the strings around those. That includes strings before the first comman and after the last comma.
In the case where the input contains no commas, you still find the initial string.
If your example had been:
String input = "451";
List<String> parts = input.split(",");
prtin(parts.length);
you would probably expect parts to be the list ["451"]. That is also what happens here because the split function doesn't distinguish empty parts from non-empty.
If a string does contain a comma, say the string ",", you get two parts when splitting, in this case two empty parts. In general, you get n+1 parts for a string containing n matches of the split pattern.
So lets say I have 3 item list:
myString = "prop zebra cool"
items = myString.split(" ")
#items = ["prop", "zebra", "cool"]
And another list content containing hudreds of string items. Its actally a list of files.
Now I want to get only the items of content that contain all of the items
So I started this way:
assets = []
for c in content:
for item in items:
if item in c:
assets.append(c)
And then somehow isolate only the items that are duplicated in assets list
And this would work fine. But I dont like that, its not elegant. And Im sure that there is some other way to deal with that in python
If I interpret your question correctly, you can use all.
In your case, assuming:
content = [
"z:/prop/zebra/rig/cool_v001.ma",
"sjasdjaskkk",
"thisIsNoGood",
"shakalaka",
"z:/prop/zebra/rig/cool_v999.ma"
]
string = "prop zebra cool"
You can do the following:
assets = []
matchlist = string.split(' ')
for c in content:
if all(s in c for s in matchlist):
assets.append(c)
print assets
Alternative Method
If you want to have more control (ie. you want to make sure that you only match strings where your words appear in the specified order), then you could go with regular expressions:
import re
# convert content to a single, tab-separated, string
contentstring = '\t'.join(content)
# generate a regex string to match
matchlist = [r'(?:{0})[^\t]+'.format(s) for s in string.split(' ')]
matchstring = r'([^\t]+{0})'.format(''.join(matchlist))
assets = re.findall(matchstring, contentstring)
print assets
Assuming \t does not appear in the strings of content, you can use it as a separator and join the list into a single string (obviously, you can pick any other separator that better suits you).
Then you can build your regex so that it matches any substring containing your words and any other character, except \t.
In this case, matchstring results in:
([^\t]+(?:prop)[^\t]+(?:zebra)[^\t]+(?:cool)[^\t]+)
where:
(?:word) means that word is matched but not returned
[^\t]+ means that all characters but \t will match
the outer () will return whole strings matching your rule (in this case z:/prop/zebra/rig/cool_v001.ma and z:/prop/zebra/rig/cool_v999.ma)
I have a Pandas dataframe column containing text that needs to be cleaned of strings that match various regex patterns. My current attempt (given below) loops through each pattern, creating a new column containing the match if found, and then loops through the dataframe, splitting the column at the found match. I then drop the unneeded matching column 're_match'.
While this works for my current use case, I can't help but think that there must be a much more efficient, vectorised way of doing this in pandas, without needing to use iterrows() and creating a new column. My question is, is there a more optimal way of removing strings that match multiple regex patterns from a column?
In my current use case the unwanted strings are always at the end of the text block, hence, the use of split(...)[0]. However, it would be great if the unwanted strings could be extracted from any point in the text.
Also, note that combining the regexes into one long single pattern would be unpreferrable, as there are tens of patterns of which will change on a regular basis.
df = pd.read_csv('data.csv', index_col=0)
patterns = [
'( regex1 \d+)',
'((?: regex 2)? \d{1,2} )',
'( \d{0,2}.?\d{0,2}-?\d{1,2}.?\d{0,2}regex3 )',
]
for p in patterns:
df['re_match'] = df['text'].str.extract(
pat=p, flags=re.IGNORECASE, expand=False
)
df['re_match'] = df['re_match'].fillna('xxxxxxxxxxxxxxx')
for index, row in df.iterrows():
df.loc[index, 'text'] = row['text'].split(row['re_match'])[0]
df = df.drop('re_match', axis=1)
Thank you for your help
There is indeed and it is called df.applymap(some_function).
Consider the following example:
from pandas import DataFrame
import pandas as pd, re
df = DataFrame({'key1': ['1000', '2000'], 'key2': ['3000', 'digits(1234)']})
def cleanitup(val):
""" Multiplies digit values """
rx = re.compile(r'^\d+$')
if rx.match(val):
return int(val) * 10
else:
return val
# here is where the magic starts
df.applymap(cleanitup)
Obviously, I made it up, but now in every cell with only digits before, these have been multiplied by 10, every other value has been left untouched.
With this in mind, you can check and rearrange your values if necessary in the function cleanitup().
I'm trying to find any occurrences of a character repeating more than 2 times in a user entered string. I have this, but it doesn't go into the if statement.
password = asDFwe23df333
s = re.compile('((\w)\2{2,})')
m = s.search(password)
if m:
print ("Password cannot contain 3 or more of the same characters in a row\n")
sys.exit(0)
You need to prefix your regex with the letter 'r', like so:
s = re.compile(r'((\w)\2{2,})')
If you don't do that, then you'll have to double up on all your backslashes since Python normally treats backlashes like an escape character in its normal strings. Since that makes regexes even harder to read then they normally are, most regexes in Python include that prefix.
Also, in your included code your password isn't in quotes, but I'm assuming it has quotes in your code.
Can't you simply go through the whole string and everytime you found a character equal to the previous, you incremented a counter, till it reached the value of 3? If the character was different from the previous, it would only be a matter of setting the counter back to 0.
EDIT:
Or, you can use:
s = 'aaabbb'
re.findall(r'((\w)\2{2,})', s)
And check if the list returned by the second line has any elements.