Series regex extract producing a dataframe - regex

I am working through a regex task on Dataquest. The following code snippet runs correctly
inside of the Dataquest IDE:
titles = hn["title"]
pattern = r'\[(\w+)\]'
tag_matches = titles.str.extract(pattern)
tag_freq = tag_matches.value_counts()
print(tag_freq, '\n')
However, on my PC running pandas 0.25.3 this exact same code block yields an error:
Traceback (most recent call last):
File "C:/Users/Mark/PycharmProjects/main/main.py", line 63, in <module>
tag_freq = tag_matches.value_counts()
File "C:\Users\Mark\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 5179, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'value_counts'
Why is tag_matches coming back as a dataframe? I am running an extract against the series 'titles'.

From the docs:
Pandas.Series.str.Extract
A pattern with one group will return a Series if expand=False.
>>> s.str.extract(r'[ab](\d)', expand=False)
0 1
1 2
2 NaN
dtype: object
So perhaps you must be explicit and set expand=False to get a series object?

Related

Trying to search excel file for a date

Im searching for a date in excel using a string that I've converted to python date. I get a error trying to convert excel values to date using the following code:
from dateutil import parser
import xlrd
d = '4/8/2019'
dt_obj = parser.parse(d)
wbpath = 'XLSX FILE'
wb = xlrd.open_workbook(wbpath)
ws = wb.sheet_by_index(1)
for rowidx in range(ws.nrows):
row = ws.row(rowidx)
for colidx, cell in enumerate(row):
if xlrd.xldate_as_tuple(cell.value, wb.datemode) == dt_obj:
print(ws.name)
print(colidx)
print(rowidx)
ERROR I get:
Traceback (most recent call last):
File "C:/Users/DKisialeu/PycharmProjects/new/YIM.py", line 12, in <module>
if xlrd.xldate_as_tuple(cell.value, wb.datemode) == dt_obj:
File "C:\Users\DKisialeu\AppData\Local\Programs\Python\Python37\lib\site-packages\xlrd\xldate.py", line 95, in xldate_as_tuple
if xldate < 0.00:
TypeError: '<' not supported between instances of 'str' and 'float'
Make sure that the dates in your excel spreadsheet are formatted as dates and not as text.
I get the same error by running your code with a spreadsheet with any text formatted cells at all.

ValueError Converting UTC time to a desired format and time zone python

I am trying to convert UTC time to a normal format and timezone. The docs are making me throw toys!! Can someone please write me a quick simple example. My code in python;
m.startAt = datetime.strptime(r['StartAt'], '%d/%m/%Y %H:%M')
Error
ValueError: time data '2016-10-28T12:42:59.389Z' does not match format '%d/%m/%Y %H:%M:'
For datetime.strptime to work you need to specify a formatting string appropriately matching the string you're parsing from. The error indicates you don't - so parsing fails. See strftime() and strptime() Behavior for the formatting arguments.
The string you get is indicated in the error message: '2016-10-28T12:42:59.389Z' (a Z/Zulu/ISO 8601 datetime string).
The matching string for that would be '%Y-%m-%dT%H:%M:%S.%f%z' or, after dropping the final Z from the string, '%Y-%m-%dT%H:%M:%S.%f'.
A bit tricky is the final Z in the string, which can be parsed by a %z, but which may not be supported in the GAE-supported python version (in my 2.7.12 it's not supported):
>>> datetime.strptime('2016-10-28T12:42:59.389', '%Y-%m-%dT%H:%M:%S.%f%z')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/_strptime.py", line 324, in _strptime
(bad_directive, format))
ValueError: 'z' is a bad directive in format '%Y-%m-%dT%H:%M:%S.%f%z'
So I stripped Z and used the other format:
>>> stripped_z = '2016-10-28T12:42:59.389Z'[:-1]
>>> stripped_z
'2016-10-28T12:42:59.389'
>>> that_datetime = datetime.strptime(stripped_z, '%Y-%m-%dT%H:%M:%S.%f')
>>> that_datetime
datetime.datetime(2016, 10, 28, 12, 42, 59, 389000)
To obtain a string use strftime:
>>> that_datetime.strftime('%d/%m/%Y %H:%M')
'28/10/2016 12:42'
It'll be more complicated if you want to use a timezone, but my recommendation is to stick with UTC on the backend storage and leave timezone conversion for the frontend/client side.
You might want to use a DateTimeProperty to store the value, in which case into you can write it directly:
entity.datetime_property = that_datetime
That error is telling you the problem, the format string has to match the datetime string you gave it.
For example:
x = datetime.strptime("2016-6-9 08:57", "%Y-%m-%d %H:%M")
Notice the second string matches the format of the first one.
Your time string looks like this:
2016-10-28T12:42:59.389Z
Which does not match your format string.

Pandas concatenating dataframes raises ValueError

I am looping through many smaller dataframes and concatenating them into a single dataframe using pandas.concat(). In the middle of the looping an exception is raised with message ValueError: Plan shapes are not aligned.
The failed dataframe contains a single row (like all the previous dataframes) and the columns are a subset of the other dataframe. A sample snippet of the code is below.
import pandas as pd
df, failed = pd.DataFrame(), pd.DataFrame()
for _file in os.listdir(file_dir):
_tmp = pd.read_csv(file_dir + _file)
try:
df= pd.concat([df, _tmp])
except ValueError as e:
if 'Plan shapes are not aligned' in str(e):
failed = pd.concat([failed, _tmp])
print [x for x in failed.columns if x not in df.columns]
print len(df), len(failed)
And I end up with the result
Out[10]: []
118 1
Checking the failures it is always the same dataframe, so the dataframe must be the problem. Printing out the dataframe I get
0 timestamp actual average_estimate median_estimate \
0 1996-11-14 01:30:00 2.300000 2.380000 2.400000
0 estimate1 estimate2 estimate3 estimate4 \
0 2.400000 2.200000 2.500000 2.600000
0 estimate5
0 2.200000
Which has a similar format to the other concatenated dataframes and the df dataframe. Is there something that I'm missing?
Extra info: I am using pandas 0.16.0
Edit: full stack trace below with modifications for anonymity
Traceback (most recent call last):
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\IPython\core\interactiveshell.py", line 3066, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-48539cb93d64>", line 37, in <module>
df = pd.concat([df, _tmp])
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\tools\merge.py", line 755, in concat
return op.get_result()
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\tools\merge.py", line 926, in get_result
mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\core\internals.py", line 4040, in concatenate_block_managers
for placement, join_units in concat_plan]
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\core\internals.py", line 4258, in combine_concat_plans
raise ValueError("Plan shapes are not aligned")
ValueError: Plan shapes are not aligned
Edit 2: Tried with 0.17.1 and 0.18.0 and still have the same error.

pandas .drop() memory error large file

For reference, this is all on a Windows 7 x64 bit machine in PyCharm Educational Edition 1.0.1, with Python 3.4.2 and Pandas 0.16.1
I have an ~791MB .csv file with ~3.04 million rows x 24 columns. The file contains liquor sales data for the state of Iowa from January 2014 to February 2015. If you are interested, the file can be found here: https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy.
One of the columns, titled store location, holds the address including latitude and longitude. The purpose of the program below is to take the latitude and longitude out of the store location cell and place each in its own cell. When the file is cut down to ~1.04 million rows, my program works properly.
1 import pandas as pd
2
3 #import the original file
4 sales = pd.read_csv('Iowa_Liquor_Sales.csv', header=0)
5
6 #transfer the copies into lists
7 lat = sales['STORE LOCATION']
8 lon = sales['STORE LOCATION']
9
10 #separate the latitude and longitude from each cell into their own list
11 hold = [i.split('(', 1)[1] for i in lat]
12 lat2 = [i.split(',', 1)[0] for i in hold]
13 lon2 = [i.split(',', 1)[1] for i in hold]
14 lon2 = [i.split(')', 1)[0] for i in lon2]
15
16 #put the now separate latitude and longitude back into their own columns
17 sales['LATITUDE'] = lat2
18 sales['LONGITUDE'] = lon2
19
20 #drop the store location column
21 sales = sales.drop(['STORE LOCATION'], axis=1)
22
23 #export the new panda data frame into a new file
24 sales.to_csv('liquor_data2.csv')
However, when I try to run the code with the full 3.04 million line file, it gives me this error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1595, in drop
dropped = self.reindex(**{axis_name: new_axis})
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2505, in reindex
**kwargs)
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1751, in reindex
self._consolidate_inplace()
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 2132, in _consolidate_inplace
self._data = self._protect_consolidate(f)
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 2125, in _protect_consolidate
result = f()
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 2131, in <lambda>
f = lambda: self._data.consolidate()
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 2833, in consolidate
bm._consolidate_inplace()
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 2838, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3817, in _consolidate
_can_consolidate=_can_consolidate)
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3840, in _merge_blocks
new_values = _vstack([b.values for b in blocks], dtype)
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3870, in _vstack
return np.vstack(to_stack)
File "C:\Python34\lib\site-packages\numpy\core\shape_base.py", line 228, in vstack
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
MemoryError
I tried running the code line-by-line in the python console and found that the error occurs after the program runs the sales = sales.drop(['STORE LOCATION'], axis=1) line.
I have searched for similar issues elsewhere and the only answer I have come up with is chunking the file as it is read by the program, like this:
#import the original file
df = pd.read_csv('Iowa_Liquor_Sales7.csv', header=0, chunksize=chunksize)
sales = pd.concat(df, ignore_index=True)
My only problem with that is then I get this error:
Traceback (most recent call last):
File "C:/Users/Aaron/PycharmProjects/DATA/Liquor_Reasign_Pd.py", line 14, in <module>
lat = sales['STORE LOCATION']
TypeError: 'TextFileReader' object is not subscriptable
My google-foo is all foo'd out. Anyone know what to do?
UPDATE
I should specify that with the chunking method,the error comes about when the program tries to duplicate the store location column.
So I found an answer to my issue. I ran the program in python 2.7 instead of python 3.4. The only change I made was deleting line 8, as it is unused. I don't know if 2.7 just handles the memory issue differently, or if I had improperly installed the pandas package in 3.4. I will reinstall pandas in 3.4 to see if that was the problem, but if anyone else has a similar issue, try your program in 2.7.
UPDATE Realized that I was running 32 bit python on a 64 bit machine. I upgraded my versions of python and it runs without memory errors now.

else statement does not return to loop

I have a code that opens a file, calculates the median value and writes that value to a separate file. Some of the files maybe empty so I wrote the following loop to check it the file is empty and if so skip it, increment the count and go back to the loop. It does what is expected for the first empty file it finds ,but not the second. The loop is below
t = 15.2
while t>=11.4:
if os.stat(r'C:\Users\Khary\Documents\bin%.2f.txt'%t ).st_size > 0:
print("All good")
F= r'C:\Users\Documents\bin%.2f.txt'%t
print(t)
F= np.loadtxt(F,skiprows=0)
LogMass = F[:,0]
LogRed = F[:,1]
value = np.median(LogMass)
filesave(*find_nearest(LogMass,LogRed))
t -=0.2
else:
t -=0.2
print("empty file")
The output is as follows
All good
15.2
All good
15.0
All good
14.8
All good
14.600000000000001
All good
14.400000000000002
All good
14.200000000000003
All good
14.000000000000004
All good
13.800000000000004
All good
13.600000000000005
All good
13.400000000000006
empty file
All good
13.000000000000007
Traceback (most recent call last):
File "C:\Users\Documents\Codes\Calculate Bin Median.py", line 35, in <module>
LogMass = F[:,0]
IndexError: too many indices
A second issue is that t somehow goes from one decimal place to 15 and the last place seems to incrementing whats with that?
Thanks for any and all help
EDIT
The error IndexError: too many indices only seems to apply to files with only one line example...
12.9982324 0.004321374
If I add a second line I no longer get the error can someone explain why this is? Thanks
EDIT
I tried a little experiment and it seems numpy does not like extracting a column if the array only has one row.
In [8]: x = np.array([1,3])
In [9]: y=x[:,0]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-9-50e27cf81d21> in <module>()
----> 1 y=x[:,0]
IndexError: too many indices
In [10]: y=x[:,0].shape
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-10-e8108cf30e9a> in <module>()
----> 1 y=x[:,0].shape
IndexError: too many indices
In [11]:
You should be using try/except blocks. Something like:
t = 15.2
while t >= 11.4:
F= r'C:\Users\Documents\bin%.2f.txt'%t
try:
F = np.loadtxt(F,skiprows=0)
LogMass = F[:,0]
LogRed = F[:,1]
value = np.median(LogMass)
filesave(*find_nearest(LogMass,LogRed))
except IndexError:
print("bad file: {}".format(F))
else:
print("file worked!")
finally:
t -=0.2
Please refer to the official tutorial for more details about exception handling.
The issue with the last digit is due to how floats work they can not represent base10 numbers exactly. This can lead to fun things like:
In [13]: .3 * 3 - .9
Out[13]: -1.1102230246251565e-16
To deal with the one line file case, add the ndmin parameter to np.loadtxt (review its doc):
np.loadtxt('test.npy',ndmin=2)
# array([[ 1., 2.]])
With the help of a user named ajcr, found the problem was that ndim=2 should have been used in numpy.loadtxt() to insure that the array always 2 has dimensions.
Python uses indentation to define if while and for blocks.
It doesn't look like your if else statement is fully indented from the while.
I usually use a full 'tab' keyboard key to indent instead of 'spaces'