Using the iPython console, I built a pandas dataframe called df.
for (k1,k2), group in df.groupby(['II','time']):
print k1,k2
print group
df['II'] stores integers between: [-10,10].
'time' can be either 930 or 1620
My goal is to save the output (of this loop) to a single .csv file. (Not great, but I copied and pasted the output to a csv. However, in doing so, I noticed that "II"== -1, at both times: 930/1620, do not appear in (full data view) like the others. (They both exist, though).
For example, for "II"== -1 # 930 it appears in the console as :
-1 930
<class 'pandas.core.frame.DataFrame'>
Int64Index: 268 entries, 2 to 2140
Data columns:
index 268 non-null values
date 268 non-null values
time 268 non-null values
price 268 non-null values
round5 268 non-null values
II 268 non-null values
Pattern 268 non-null values
pl 268 non-null values
dtypes: float64(2), int64(4), object(2)
With the knowledge that it exists, I tried brute force, pulling them manually:
u=df['II']== -1
one=df.groupby('time')[u]
#To check the result:
one.to_csv('file.csv')
I'm grouping by 'time', so all times should appear. Yet the resulting csv only contains the 1620 times--all results at 930 are, unfortunately, missing in action. It's bizarre. Your suggestions greatly appreciated.
Related
I am using python API of SAS, and have uploaded a table by:
s.upload("./data/hmeq.csv", casout=dict(name=tbl_name, replace=True))
I can see the details of the table by s.tableinfo().
§ TableInfo
Name Rows Columns IndexedColumns Encoding CreateTimeFormatted ModTimeFormatted AccessTimeFormatted JavaCharSet CreateTime ... Repeated View MultiPart SourceName SourceCaslib Compressed Creator Modifier SourceModTimeFormatted SourceModTime
0 HMEQ 5960 13 0 utf-8 2020-02-10T16:48:02-05:00 2020-02-10T16:48:02-05:00 2020-02-10T21:10:34-05:00 UTF8 1.896990e+09 ... 0 0 0 0 aforoo 2020-02-10T16:48:02-05:00 1.896990e+09
1 rows × 23 columns
But, I cannot access any value of the table in python. For example, assume I want to get the number of rows and columns as a python scalar. I know that I can get the SAS tables into pandas tables by using pd.DataFrame, but it does not work for this table and I get:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
346 dtype=dtype, copy=copy)
347 elif isinstance(data, dict):
--> 348 mgr = self._init_dict(data, index, columns, dtype=dtype)
349 elif isinstance(data, ma.MaskedArray):
350 import numpy.ma.mrecords as mrecords
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in _init_dict(self, data, index, columns, dtype)
457 arrays = [data[k] for k in keys]
458
--> 459 return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
460
461 def _init_ndarray(self, values, index, columns, dtype=None, copy=False):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
7354 # figure out the index, if necessary
7355 if index is None:
-> 7356 index = extract_index(arrays)
7357
7358 # don't force copy because getting jammed in an ndarray anyway
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in extract_index(data)
7391
7392 if not indexes and not raw_lengths:
-> 7393 raise ValueError('If using all scalar values, you must pass'
7394 ' an index')
7395
ValueError: If using all scalar values, you must pass an index
I have same issue with any other casout table in SAS. I appreciate any help or comment.
I would suggest you use directly Pandas to read from SAS.
Reference from another answer: Read SAS file with pandas
Here is another example
https://www.marsja.se/how-to-read-sas-files-in-python-with-pandas/
I found the solution below and it works fine. For example, here I have used dataSciencePilot.exploreData action and I can get the results by:
casout = dict(name = 'out1', replace=True)
s.dataSciencePilot.exploreData(table=tbl_name, target='bad', casout=casout)
fetch_opts = dict(maxrows=100000000, to=1000000)
df = s.fetch(table='out1', **fetch_opts)['Fetch']
features = pd.DataFrame(df)
type(features)
which returns pandas.core.frame.DataFrame.
I am trying to extract data from a k6 output (https://docs.k6.io/docs/results-output):
data_received.........: 246 kB 21 kB/s
data_sent.............: 174 kB 15 kB/s
http_req_blocked......: avg=26.24ms min=0s med=13.5ms max=145.27ms p(90)=61.04ms p(95)=70.04ms
http_req_connecting...: avg=23.96ms min=0s med=12ms max=145.27ms p(90)=57.03ms p(95)=66.04ms
http_req_duration.....: avg=197.41ms min=70.32ms med=91.56ms max=619.44ms p(90)=288.2ms p(95)=326.23ms
http_req_receiving....: avg=141.82µs min=0s med=0s max=1ms p(90)=1ms p(95)=1ms
http_req_sending......: avg=8.15ms min=0s med=0s max=334.23ms p(90)=1ms p(95)=1ms
http_req_waiting......: avg=189.12ms min=70.04ms med=91.06ms max=343.42ms p(90)=282.2ms p(95)=309.22ms
http_reqs.............: 190 16.054553/s
iterations............: 5 0.422488/s
vus...................: 200 min=200 max=200
vus_max...............: 200 min=200 max=200
The data comes in the above format and I am trying to find a way to get each line in the above along with the values only. As an example:
http_req_duration: 197.41ms, 70.32ms,91.56ms, 619.44ms, 288.2ms, 326.23ms
I have to do this for ~50-100 files and want to find a RegEx or similar quicker way to do it, without writing too much code. Is it possible?
Here's a simple Python solution:
import re
FIELD = re.compile(r"(\w+)\.*:(.*)", re.DOTALL) # split the line to name:value
VALUES = re.compile(r"(?<==).*?(?=\s|$)") # match individual values from http_req_* fields
# open the input file `k6_input.log` for reading, and k6_parsed.log` for parsing
with open("k6_input.log", "r") as f_in, open("k6_parsed.log", "w") as f_out:
for line in f_in: # read the input file line by line
field = FIELD.match(line) # first match all <field_name>...:<values> fields
if field:
name = field.group(1) # get the field name from the first capture group
f_out.write(name + ": ") # write the field name to the output file
value = field.group(2) # get the field value from the second capture group
if name[:9] == "http_req_": # parse out only http_req_* fields
f_out.write(", ".join(VALUES.findall(value)) + "\n") # extract the values
else: # verbatim copy of other fields
f_out.write(value)
else: # encountered unrecognizable field, just copy the line
f_out.write(line)
For a file with contents as above you'll get a resulting:
data_received: 246 kB 21 kB/s
data_sent: 174 kB 15 kB/s
http_req_blocked: 26.24ms, 0s, 13.5ms, 145.27ms, 61.04ms, 70.04ms
http_req_connecting: 23.96ms, 0s, 12ms, 145.27ms, 57.03ms, 66.04ms
http_req_duration: 197.41ms, 70.32ms, 91.56ms, 619.44ms, 288.2ms, 326.23ms
http_req_receiving: 141.82µs, 0s, 0s, 1ms, 1ms, 1ms
http_req_sending: 8.15ms, 0s, 0s, 334.23ms, 1ms, 1ms
http_req_waiting: 189.12ms, 70.04ms, 91.06ms, 343.42ms, 282.2ms, 309.22ms
http_reqs: 190 16.054553/s
iterations: 5 0.422488/s
vus: 200 min=200 max=200
vus_max: 200 min=200 max=200
If you have to run it over many files, I'd suggest you to investigate os.glob(), os.walk() or os.listdir() to list all the files you need and then loop over them and execute the above, thus further automating the process.
I am referring to this question I posted days ago, I haven't' get any replies yet and I suspect that the situation was not properly described, I made a simpler set up that would be easier to understand, and hopefully. get more attention from the experienced programmers!
I forgot to mention, I am running Python 2 on Jupyter
import pandas as pd
from pandas import Series, DataFrame
g_input_df = pd.read_csv('SetsLoc.csv')
URL=g_input_df.iloc[0,0]
c_input_df = pd.read_csv(URL)
c_input_df = c_input_df.set_index("Parameter")
root_path = c_input_df.loc["root_1"]
input_rel_path = c_input_df.loc["root_2"]
input_file_name = c_input_df.loc["file_name"]
This section reads from a .csv a list of paths, just one at a time, each one of them directing to another .csv file that contains the input for a simulation to be set-up using python.
The results from the above code can be tested here:
c_input_df
Value Parameter
root_1 C:/SimpleTest/
root_2 Input/
file_name Prop_1.csv
URL
'C:/SimpleTest/Sets/Set_1.csv'
root_path+input_rel_path+input_file_name
Value C:/SimpleTest/Input/Prop_1.csv
dtype: object
Property_1 = pd.read_csv('C:/SimpleTest/Input/Prop_1.csv')
Property_1
height weight
0 100 50
1 110 44
2 98 42
...on the other hand, if I try to use varibales to describe the file's path and name I get an error:
Property_1 = pd.read_csv(root_path+input_rel_path+input_file_name)
Property_1
I get the following error:
ValueErrorTraceback (most recent call last)
<ipython-input-3-1d5306b6bdb5> in <module>()
----> 1 Property_1 = pd.read_csv(root_path+input_rel_path+input_file_name)
2 Property_1
C:\ProgramData\Anaconda2\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
653 skip_blank_lines=skip_blank_lines)
654
--> 655 return _read(filepath_or_buffer, kwds)
656
657 parser_f.__name__ = name
C:\ProgramData\Anaconda2\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
390 compression = _infer_compression(filepath_or_buffer, compression)
391 filepath_or_buffer, _, compression = get_filepath_or_buffer(
--> 392 filepath_or_buffer, encoding, compression)
393 kwds['compression'] = compression
394
C:\ProgramData\Anaconda2\lib\site-packages\pandas\io\common.pyc in get_filepath_or_buffer(filepath_or_buffer, encoding, compression)
208 if not is_file_like(filepath_or_buffer):
209 msg = "Invalid file path or buffer object type: {_type}"
--> 210 raise ValueError(msg.format(_type=type(filepath_or_buffer)))
211
212 return filepath_or_buffer, None, compression
ValueError: Invalid file path or buffer object type: <class 'pandas.core.series.Series'>}
I beleive that the problem resides in the way the parameters that make up the path and filenemae are read from the dataframe, is there any way to specify that those parameters are paths, or something similar that will avoid this problem?
Any help is highly appreciated!
I posted the solution in the other question related to this post, in case someone wants to get a look:
Problems opening a path in Python
I am able to convert a csv file to pandas DataFormat and able to print out the table, as seen below. However, when I try to print out the Height column I get an error. How can I fix this?
import pandas as pd
df = pd.read_csv('/path../NavieBayes.csv')
print df #this prints out as seen below
print df.Height #this gives me the "AttributeError: 'DataFrame' object has no attribute 'Height'
Height Weight Classifer
0 70.0 180 Adult
1 58.0 109 Adult
2 59.0 111 Adult
3 60.0 113 Adult
4 61.0 115 Adult
I have run into a similar issue before when reading from csv. Assuming it is the same:
col_name =df.columns[0]
df=df.rename(columns = {col_name:'new_name'})
The error in my case was caused by (I think) by a byte order marker in the csv or some other non-printing character being added to the first column label. df.columns returns an array of the column names. df.columns[0] gets the first one. Try printing it and seeing if something is odd with the results.
PS On above answer by JAB - if there is clearly spaces in your column names use skipinitialspace=True in read_csv e.g.
df = pd.read_csv('/path../NavieBayes.csv',skipinitialspace=True)
df = pd.read_csv(r'path_of_file\csv_file_name.csv')
OR
df = pd.read_csv('path_of_file/csv_file_name.csv')
Example:
data = pd.read_csv(r'F:\Desktop\datasets\hackathon+data+set.csv')
Try it, it will work.
I am working with pandas dataframe. I am interested in obtaining a new data frame based on a condition applied to a column of a already existing datafame. Here is the dataframe:
users_df
Out[30]:
<class 'pandas.core.frame.DataFrame'>
Index: 3595 entries,
Data columns (total 9 columns):
screen_name 3595 non-null values
User_Desc 3595 non-null values
lang 3595 non-null values
followers_count 3579 non-null values
friends_count 3580 non-null values
listed_count 2665 non-null values
statuses_count 3595 non-null values
stem_key_flag 3595 non-null values
stem_keys 3595 non-null values
dtypes: bool(1), float64(3), int64(1), object(4)
What I am doing is
en_users_df = users_df[users_df['stem_key_flag']==True]
but I get exacly the same answer as top code block. Which means it not filtering anything. Am I doing something which was compatible in earlier version but not now ? If not, what is the mistake I am making ?
I also tried similar approach with the another column which is an int data type and it works fine.
fol_cnt_users_df = users_df[users_df['followers_count'] >1000]
In [35]: fol_cnt_users_df
Out[35]:
<class 'pandas.core.frame.DataFrame'>
Index: 724 entries, 2013-06-20, 12:13:46 to 2013-06-19, 18:26:48
Data columns (total 9 columns):
screen_name 724 non-null values
User_Desc 724 non-null values
lang 724 non-null values
followers_count 724 non-null values
friends_count 722 non-null values
listed_count 714 non-null values
statuses_count 724 non-null values
stem_key_flag 724 non-null values
stem_keys 724 non-null values
dtypes: bool(1), float64(3), int64(1), object(4)
Thanks for the help in advance.
Your problem is likely a version issue (I assume you are using either 0.10 or 0.11). I've tested your code and if stem_key_flag column contains any False values, then it should return a different dataframe. However, since this thread became moderately popular, for the sake of future visitors, I would like to state that your filtering line (noted below) is correct:
en_users_df = users_df[users_df['stem_key_flag']==True]
Nonetheless, you will achieve identical results with a simpler line such as
en_users_df = users_df[users_df.stem_key_flag]