Problems importing strings to form a path from a .csv file - python-2.7

I am referring to this question I posted days ago, I haven't' get any replies yet and I suspect that the situation was not properly described, I made a simpler set up that would be easier to understand, and hopefully. get more attention from the experienced programmers!
I forgot to mention, I am running Python 2 on Jupyter
import pandas as pd
from pandas import Series, DataFrame
g_input_df = pd.read_csv('SetsLoc.csv')
URL=g_input_df.iloc[0,0]
c_input_df = pd.read_csv(URL)
c_input_df = c_input_df.set_index("Parameter")
root_path = c_input_df.loc["root_1"]
input_rel_path = c_input_df.loc["root_2"]
input_file_name = c_input_df.loc["file_name"]
This section reads from a .csv a list of paths, just one at a time, each one of them directing to another .csv file that contains the input for a simulation to be set-up using python.
The results from the above code can be tested here:
c_input_df
Value Parameter
root_1 C:/SimpleTest/
root_2 Input/
file_name Prop_1.csv
URL
'C:/SimpleTest/Sets/Set_1.csv'
root_path+input_rel_path+input_file_name
Value C:/SimpleTest/Input/Prop_1.csv
dtype: object
Property_1 = pd.read_csv('C:/SimpleTest/Input/Prop_1.csv')
Property_1
height weight
0 100 50
1 110 44
2 98 42
...on the other hand, if I try to use varibales to describe the file's path and name I get an error:
Property_1 = pd.read_csv(root_path+input_rel_path+input_file_name)
Property_1
I get the following error:
ValueErrorTraceback (most recent call last)
<ipython-input-3-1d5306b6bdb5> in <module>()
----> 1 Property_1 = pd.read_csv(root_path+input_rel_path+input_file_name)
2 Property_1
C:\ProgramData\Anaconda2\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
653 skip_blank_lines=skip_blank_lines)
654
--> 655 return _read(filepath_or_buffer, kwds)
656
657 parser_f.__name__ = name
C:\ProgramData\Anaconda2\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
390 compression = _infer_compression(filepath_or_buffer, compression)
391 filepath_or_buffer, _, compression = get_filepath_or_buffer(
--> 392 filepath_or_buffer, encoding, compression)
393 kwds['compression'] = compression
394
C:\ProgramData\Anaconda2\lib\site-packages\pandas\io\common.pyc in get_filepath_or_buffer(filepath_or_buffer, encoding, compression)
208 if not is_file_like(filepath_or_buffer):
209 msg = "Invalid file path or buffer object type: {_type}"
--> 210 raise ValueError(msg.format(_type=type(filepath_or_buffer)))
211
212 return filepath_or_buffer, None, compression
ValueError: Invalid file path or buffer object type: <class 'pandas.core.series.Series'>}
I beleive that the problem resides in the way the parameters that make up the path and filenemae are read from the dataframe, is there any way to specify that those parameters are paths, or something similar that will avoid this problem?
Any help is highly appreciated!

I posted the solution in the other question related to this post, in case someone wants to get a look:
Problems opening a path in Python

Related

Parse output from k6 data to get specific information

I am trying to extract data from a k6 output (https://docs.k6.io/docs/results-output):
data_received.........: 246 kB 21 kB/s
data_sent.............: 174 kB 15 kB/s
http_req_blocked......: avg=26.24ms min=0s med=13.5ms max=145.27ms p(90)=61.04ms p(95)=70.04ms
http_req_connecting...: avg=23.96ms min=0s med=12ms max=145.27ms p(90)=57.03ms p(95)=66.04ms
http_req_duration.....: avg=197.41ms min=70.32ms med=91.56ms max=619.44ms p(90)=288.2ms p(95)=326.23ms
http_req_receiving....: avg=141.82µs min=0s med=0s max=1ms p(90)=1ms p(95)=1ms
http_req_sending......: avg=8.15ms min=0s med=0s max=334.23ms p(90)=1ms p(95)=1ms
http_req_waiting......: avg=189.12ms min=70.04ms med=91.06ms max=343.42ms p(90)=282.2ms p(95)=309.22ms
http_reqs.............: 190 16.054553/s
iterations............: 5 0.422488/s
vus...................: 200 min=200 max=200
vus_max...............: 200 min=200 max=200
The data comes in the above format and I am trying to find a way to get each line in the above along with the values only. As an example:
http_req_duration: 197.41ms, 70.32ms,91.56ms, 619.44ms, 288.2ms, 326.23ms
I have to do this for ~50-100 files and want to find a RegEx or similar quicker way to do it, without writing too much code. Is it possible?
Here's a simple Python solution:
import re
FIELD = re.compile(r"(\w+)\.*:(.*)", re.DOTALL) # split the line to name:value
VALUES = re.compile(r"(?<==).*?(?=\s|$)") # match individual values from http_req_* fields
# open the input file `k6_input.log` for reading, and k6_parsed.log` for parsing
with open("k6_input.log", "r") as f_in, open("k6_parsed.log", "w") as f_out:
for line in f_in: # read the input file line by line
field = FIELD.match(line) # first match all <field_name>...:<values> fields
if field:
name = field.group(1) # get the field name from the first capture group
f_out.write(name + ": ") # write the field name to the output file
value = field.group(2) # get the field value from the second capture group
if name[:9] == "http_req_": # parse out only http_req_* fields
f_out.write(", ".join(VALUES.findall(value)) + "\n") # extract the values
else: # verbatim copy of other fields
f_out.write(value)
else: # encountered unrecognizable field, just copy the line
f_out.write(line)
For a file with contents as above you'll get a resulting:
data_received: 246 kB 21 kB/s
data_sent: 174 kB 15 kB/s
http_req_blocked: 26.24ms, 0s, 13.5ms, 145.27ms, 61.04ms, 70.04ms
http_req_connecting: 23.96ms, 0s, 12ms, 145.27ms, 57.03ms, 66.04ms
http_req_duration: 197.41ms, 70.32ms, 91.56ms, 619.44ms, 288.2ms, 326.23ms
http_req_receiving: 141.82µs, 0s, 0s, 1ms, 1ms, 1ms
http_req_sending: 8.15ms, 0s, 0s, 334.23ms, 1ms, 1ms
http_req_waiting: 189.12ms, 70.04ms, 91.06ms, 343.42ms, 282.2ms, 309.22ms
http_reqs: 190 16.054553/s
iterations: 5 0.422488/s
vus: 200 min=200 max=200
vus_max: 200 min=200 max=200
If you have to run it over many files, I'd suggest you to investigate os.glob(), os.walk() or os.listdir() to list all the files you need and then loop over them and execute the above, thus further automating the process.

python pandas dataframe index, error TypeError: Input must be iterable, pandas version perhaps wrong

I'm working with the eda-explorer python library from MIT, which allows one to import physiological data files from particular wearable biosensors. This libraray uses pandas DataFrames to store the physiological timeseries. I've been using this libarary in different computing set-ups. When I try to use it in my ubuntu 15.10 environment I get an error message I don't understand. It is related to the following function which is instrumental in getting the data into a DataFrame and doing some intitial transformations:
def loadData_E4(filepath):
# Load data
data = pd.DataFrame.from_csv(os.path.join(filepath,'EDA.csv'))
data.reset_index(inplace=True)
# Get the startTime and sample rate
startTime = pd.to_datetime(float(data.columns.values[0]),unit="s")
sampleRate = float(data.iloc[0][0])
data = data[data.index!=0]
data.index = data.index-1
This results in the following error messages:
In [1]:
run batch_edaexplorer_template.py
Classifying data for ...[my file location]...
---------------------------------------------------------------------
TypeError Traceback (most recent call last)
/...mypath/eda-explorer-master/batch_edaexplorer_template.py in <module>()
69 elif dataType=='e4':
70 print "Classifying data for " + filepath
---> 71 labels,data = classify(filepath,classifierList,pickleDirectory,lf.loadData_E4)
72 elif dataType=="misc":
73 print "Classifying data for " + filepath
/...mypath/eda-explorer-master/EDA_Artifact_Detection_Script.pyc in classify(filepath, classifierList, pickleDirectory, loadDataFunction)
225
226 # Load data
--> 227 data = loadDataFunction(filepath)
228
229 # Get pickle List and featureNames list
/...mypath/eda-explorer-master/load_files.pyc in loadData_E4(filepath)
58 sampleRate = float(data.iloc[0][0])
59 data = data[data.index!=0]
---> 60 data.index = data.index-1
61
62 # Reset the data frame assuming 4Hz samplingRate
/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in __sub__(self, other)
1161 warnings.warn("using '-' to provide set differences with Indexes is deprecated, "
1162 "use .difference()",FutureWarning)
-> 1163 return self.difference(other)
1164
1165 def __and__(self, other):
/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in difference(self, other)
1314
1315 if not hasattr(other, '__iter__'):
-> 1316 raise TypeError('Input must be iterable!')
1317
1318 if self.equals(other):
TypeError: Input must be iterable!
I don't get this error message on my windows PC. I'm using pandas version 0.15.0 in the ubuntu environment. Is this perhaps the problem that the particular syntax related to the index is only allowed in higher versions of pandas? How should I correct the syntax so that it works with older version of pandas? Or am I missing the point?
Try data.index = pd.Index(data.index.values-1) instead of data.index = data.index-1.

'numpy.int64' error thwarts my automation of querying Google Distance Matrix API. Solutions?

Goal:
To automate obtaining drive duration's by querying a list (see the CSV setup below) of Zipcodes ('Origin_Zip') to Addresses ('Destination_BH') using Google Distance Matrix API to obtain drive time (minutes) in the, "time_to_BH" row. I am using Pandas to move the data between the csv and call Google matrix. However, I am receiving the following error:
Error:
TypeError: argument of type 'numpy.int64' is not iterable
I am using this GitHub as blueprint to structure the Google Maps Distance portion. I am using Python 2.7.
Code:
from google import search
import pandas as pd
from pandas import DataFrame
import googlemaps
from googlemaps import convert
from googlemaps.convert import as_list
import datetime
#stores my API code as 'gmaps'
key = '(my API Key)'
client = googlemaps.Client(key)
#establishes: drive time (in minutes), english, non-metric measurements, trip occurs at 1:00pm PST
def distance_matrix(client, origins, destinations,
mode="driving", language="en", avoid=None, units="imperial",
departure_time=None, arrival_time=None, transit_mode=None,transit_routing_preference=None):
#establishes "origin" and "destinations" header format to direct pandas to begin.
params = {
"origins": 'Origin_Zip',
"destinations": 'Destination_BH'
}
#Reads the strings within csv's("drive_ca.csv") rows via the indicated column (usecols=) to automate queryinig, google distance Matrix API calls
df = pd.read_csv('C:\Users\Desktop\drive_ca.csv', usecols=['Origin_Zip'])
#Number indicates outputs to result
stop = 1
#Assigns a column name to iterate
urlcols = ['Destination_BH']
# First, apply() to call the google distance Matrix for each 'row'
# A list is built for the urls return by search()
df[urlcols] = df['Origin_Zip'].apply(lambda Origin_Zip : pd.Series([destinations for destinations in search(Origin_Zip, stop=stop, pause=5.0)][:stop]))
departure_time = datetime.datetime.fromtimestamp(1428580693);
if mode:
# NOTE(broady): the mode parameter is not validated by the Maps API
# server. Check here to prevent silent failures.
if mode not in ["driving", "walking", "bicycling", "transit"]:
raise ValueError("Invalid travel mode.")
params["mode"] = mode
if language:
params["language"] = language
if avoid:
if avoid not in ["tolls", "highways", "ferries"]:
raise ValueError("Invalid route restriction.")
params["avoid"] = avoid
if units:
params["units"] = units
if departure_time:
params["departure_time"] = convert.time(departure_time)
if arrival_time:
params["arrival_time"] = convert.time(arrival_time)
if departure_time and arrival_time:
raise ValueError("Should not specify both departure_time and"
"arrival_time.")
if transit_mode:
params["transit_mode"] = convert.join_list("|", transit_mode)
if transit_routing_preference:
params["transit_routing_preference"] = transit_routing_preference
print params
return client._get("/maps/api/distancematrix/json", params)
#prints corresponding duration to the indicated header row in "drive_ca.csv"
df.to_csv('C:\Users\Desktop\drive_ca.csv', usecols=['Destination_BH'])
Complete Traceback:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-1a75d4fe26fb> in <module>()
34 # First, apply() to call the google distance Matrix for each 'row'
35 # A list is built for the urls return by search()
---> 36 df[urlcols] = df['Origin_Zip'].apply(lambda Origin_Zip : pd.Series([destinations for destinations in search(Origin_Zip, stop=stop, pause=5.0)][:stop]))
37
38
C:\Users\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\series.pyc in apply(self, func, convert_dtype, args, **kwds)
2056 values = lib.map_infer(values, lib.Timestamp)
2057
-> 2058 mapped = lib.map_infer(values, f, convert=convert_dtype)
2059 if len(mapped) and isinstance(mapped[0], Series):
2060 from pandas.core.frame import DataFrame
C:\Users\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\lib.pyd in pandas.lib.map_infer (pandas\lib.c:56997)()
<ipython-input-4-1a75d4fe26fb> in <lambda>(Origin_Zip)
34 # First, apply() to call the google distance Matrix for each 'row'
35 # A list is built for the urls return by search()
---> 36 df[urlcols] = df['Origin_Zip'].apply(lambda Origin_Zip : pd.Series([destinations for destinations in search(Origin_Zip, stop=stop, pause=5.0)][:stop]))
37
38
C:\Users\AppData\Local\Continuum\Anaconda2\lib\site-packages\google.pyc in search(query, tld, lang, num, start, stop, pause, only_standard)
174
175 # Prepare the search string.
--> 176 query = quote_plus(query)
177
178 # Grab the cookie from the home page.
C:\Users\AppData\Local\Continuum\Anaconda2\lib\urllib.pyc in quote_plus(s, safe)
1290 def quote_plus(s, safe=''):
1291 """Quote the query fragment of a URL; replacing ' ' with '+'"""
-> 1292 if ' ' in s:
1293 s = quote(s, safe + ' ')
1294 return s.replace(' ', '+')
TypeError: argument of type 'numpy.int64' is not iterable
.CSV Setup:
It's possible to diagnose the problem just by looking at the traceback. Working backwards from where the exception was raised:
C:\Users\AppData\Local\Continuum\Anaconda2\lib\urllib.pyc in quote_plus(s, safe)
1290 def quote_plus(s, safe=''):
1291 """Quote the query fragment of a URL; replacing ' ' with '+'"""
-> 1292 if ' ' in s:
1293 s = quote(s, safe + ' ')
1294 return s.replace(' ', '+')
TypeError: argument of type 'numpy.int64' is not iterable
This tells me that s is a numpy.int64 rather than a string. s is the query input to quote_plus(query):
C:\Users\AppData\Local\Continuum\Anaconda2\lib\site-packages\google.pyc in search(query, tld, lang, num, start, stop, pause, only_standard)
174
175 # Prepare the search string.
--> 176 query = quote_plus(query)
177
178 # Grab the cookie from the home page.
From looking at the part after "in", which shows where these lines were executed, I can tell that query is the first argument to the google.search() function:
search(query, tld, lang, num, start, stop, pause, only_standard)
Without even looking at the documentation, I can therefore infer from the traceback that search expects its first argument to be a string, but it is currently getting a numpy.int64.
The input to google.search() is generated by this nasty-looking lambda function:
C:\Users\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\lib.pyd in pandas.lib.map_infer (pandas\lib.c:56997)()
<ipython-input-4-1a75d4fe26fb> in <lambda>(Origin_Zip)
34 # First, apply() to call the google distance Matrix for each 'row'
35 # A list is built for the urls return by search()
---> 36 df[urlcols] = df['Origin_Zip'].apply(lambda Origin_Zip : pd.Series([destinations for destinations in search(Origin_Zip, stop=stop, pause=5.0)][:stop]))
37
38
The relevant part is search(Origin_Zip, stop=stop, pause=5.0). Each Origin_Zip here will be a value taken from the 'Origin_Zip' column of df, which pinpoints the source of the problem: df['Origin_Zip'] should contain strings, but at the moment it contains numpy.int64s.
Based on your screenshot, I'm guessing that since the string values in the CSV file look like '90278', pandas is automatically converting them to integer values. If you convert that column to strings then the problem will probably go away, for example:
df['Origin_Zip'] = df['Origin_Zip'].astype(str)

How optimize word counting in Python?

I'm taking my first steps writing code to do linguistic analysis of texts. I use Python and the NLTK library. The problem is that the actual counting of words takes up close to 100 % of my CPU (iCore5, 8GB RAM, macbook air 2014) and ran for 14 hours before I shut the process down. How can I speed the looping and counting up?
I have created a corpus in NLTK out of three Swedish UTF-8 formatted, tab-separated files Swe_Newspapers.txt, Swe_Blogs.txt, Swe_Twitter.txt. It works fine:
import nltk
my_corpus = nltk.corpus.CategorizedPlaintextCorpusReader(".", r"Swe_.*", cat_pattern=r"Swe_(\w+)\.txt")
Then I've loaded a text-file with one word per line into NLTK. That also works fine.
my_wordlist = nltk.corpus.WordListCorpusReader("/Users/mos/Documents/", "wordlist.txt")
The text-file I want to analyse (Swe_Blogs.txt) has this structure, and works fine to parse:
Wordpress.com 2010/12/08 3 1,4,11 osv osv osv …
bloggagratis.se 2010/02/02 3 0 Jag är utled på plogade vägar, matte är lika utled hon.
wordpress.com 2010/03/10 3 0 1 kruka Sallad, riven
EDIT: The suggestion to produce the counter as below, does not work, but can be fixed:
counter = collections.Counter(word for word in my_corpus.words(categories=["Blogs"]) if word in my_wordlist)
This produces the error:
IOError Traceback (most recent call last)
<ipython-input-41-1868952ba9b1> in <module>()
----> 1 counter = collections.Counter(word for word in my_corpus.words("Blogs") if word in my_wordlist)
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, categories)
182 def words(self, fileids=None, categories=None):
183 return PlaintextCorpusReader.words(
--> 184 self, self._resolve(fileids, categories))
185 def sents(self, fileids=None, categories=None):
186 return PlaintextCorpusReader.sents(
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, sourced)
89 encoding=enc)
90 for (path, enc, fileid)
---> 91 in self.abspaths(fileids, True, True)])
92
93
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/api.pyc in abspaths(self, fileids, include_encoding, include_fileid)
165 fileids = [fileids]
166
--> 167 paths = [self._root.join(f) for f in fileids]
168
169 if include_encoding and include_fileid:
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/ lib/python2.7/site-packages/nltk/data.pyc in join(self, fileid)
174 def join(self, fileid):
175 path = os.path.join(self._path, *fileid.split('/'))
--> 176 return FileSystemPathPointer(path)
177
178 def __repr__(self):
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/ lib/python2.7/site-packages/nltk/data.pyc in __init__(self, path)
152 path = os.path.abspath(path)
153 if not os.path.exists(path):
--> 154 raise IOError('No such file or directory: %r' % path)
155 self._path = path
IOError: No such file or directory: '/Users/mos/Documents/Blogs'
A fix is to assign my_corpus(categories=["Blogs"] to a variable:
blogs_text = my_corpus.words(categories=["Blogs"])
It's when I try to count all occurrences of each word (about 20K words) in the wordlist within the blogs in the corpus (115,7 MB) that my computer get's a little tired. How can I speed up the following code? It seems to work, no error messages, but it takes >14h to execute.
import collections
counter = collections.Counter()
for word in my_corpus.words(categories="Blogs"):
for token in my_wordlist.words():
if token == word:
counter[token]+=1
else:
continue
Any help to improve my coding skills is much appreciated!
It seems like your double loop could be improved:
for word in mycorp.words(categories="Blogs"):
for token in my_wordlist.words():
if token == word:
counter[token]+=1
This would be much faster as:
words = set(my_wordlist.words()) # call once, make set for fast check
for word in mycorp.words(categories="Blogs"):
if word in words:
counter[word] += 1
This takes you from doing len(my_wordlist.words()) * len(mycorp.words(...)) operations to closer to len(my_wordlist.words()) + len(mycorp.words(...)) operations, as building the set is O(n) and checking whether a word is in the set is O(1) on average.
You can also build the Counter direct from an iterable, as Two-Bit Alchemist points out:
counter = Counter(word for word in mycorp.words(categories="Blogs")
if word in words)
You already got good answers on how to count words properly with Python. The problem is that it will still be quite slow. If you are just exploring the corpora, using a chain of UNIX tools gives you a much quicker result. Assuming that your text is tokenized, something like this gives you the first 100 tokens in descending order:
cat Swe_Blogs.txt | cut --delimiter='\t' --fields=5 | tr ' ' '\n' | sort | uniq -c | sort -nr | head -n 100

Send output from iPython console to .csv file. (& viewing data issue)

Using the iPython console, I built a pandas dataframe called df.
for (k1,k2), group in df.groupby(['II','time']):
print k1,k2
print group
df['II'] stores integers between: [-10,10].
'time' can be either 930 or 1620
My goal is to save the output (of this loop) to a single .csv file. (Not great, but I copied and pasted the output to a csv. However, in doing so, I noticed that "II"== -1, at both times: 930/1620, do not appear in (full data view) like the others. (They both exist, though).
For example, for "II"== -1 # 930 it appears in the console as :
-1 930
<class 'pandas.core.frame.DataFrame'>
Int64Index: 268 entries, 2 to 2140
Data columns:
index 268 non-null values
date 268 non-null values
time 268 non-null values
price 268 non-null values
round5 268 non-null values
II 268 non-null values
Pattern 268 non-null values
pl 268 non-null values
dtypes: float64(2), int64(4), object(2)
With the knowledge that it exists, I tried brute force, pulling them manually:
u=df['II']== -1
one=df.groupby('time')[u]
#To check the result:
one.to_csv('file.csv')
I'm grouping by 'time', so all times should appear. Yet the resulting csv only contains the 1620 times--all results at 930 are, unfortunately, missing in action. It's bizarre. Your suggestions greatly appreciated.