Cachegring file very small - profiling

I am new to profiling. I am trying to profile my PHP with xdebug.
The cachegrind file is created but has no significant content
I have set xdebug.profiler_enable_trigger = 1
xdebug.profiler_output_name = cachegrind+%p+%H+%R.cg
I call my page with additional GET parameter ?XDEBUG_PROFILE=1
My cachegrind file is generated but has no significant content
Here is my output:
version: 1
creator: xdebug 2.7.0alpha1 (PHP 7.0.30-dev)
cmd: C:\WPNserver\www\DMResources\Classes\VendorClasses\PHPMySQLiDatabase\MysqliDb.php
part: 1
positions: line
events: Time Memory
fl=(1)
fn=(221) php::mysqli->close
1244 103 -14832
fl=(42)
fn=(222) MysqliDbExt->__destruct
1239 56 0
cfl=(1)
cfn=(221)
calls=1 0 0
1244 103 -14832
That's it - I must be missing something fundamental.

I think you hit this bug in xdebug.
As suggested by Derick in the issue tracker, you can workaround this by adding %r to the profiler output name. eg: xdebug.profiler_output_name = cachegrind+%p+%H+%R+%r.cg
(with %r adding a random number to the name)

Related

Parse output from k6 data to get specific information

I am trying to extract data from a k6 output (https://docs.k6.io/docs/results-output):
data_received.........: 246 kB 21 kB/s
data_sent.............: 174 kB 15 kB/s
http_req_blocked......: avg=26.24ms min=0s med=13.5ms max=145.27ms p(90)=61.04ms p(95)=70.04ms
http_req_connecting...: avg=23.96ms min=0s med=12ms max=145.27ms p(90)=57.03ms p(95)=66.04ms
http_req_duration.....: avg=197.41ms min=70.32ms med=91.56ms max=619.44ms p(90)=288.2ms p(95)=326.23ms
http_req_receiving....: avg=141.82µs min=0s med=0s max=1ms p(90)=1ms p(95)=1ms
http_req_sending......: avg=8.15ms min=0s med=0s max=334.23ms p(90)=1ms p(95)=1ms
http_req_waiting......: avg=189.12ms min=70.04ms med=91.06ms max=343.42ms p(90)=282.2ms p(95)=309.22ms
http_reqs.............: 190 16.054553/s
iterations............: 5 0.422488/s
vus...................: 200 min=200 max=200
vus_max...............: 200 min=200 max=200
The data comes in the above format and I am trying to find a way to get each line in the above along with the values only. As an example:
http_req_duration: 197.41ms, 70.32ms,91.56ms, 619.44ms, 288.2ms, 326.23ms
I have to do this for ~50-100 files and want to find a RegEx or similar quicker way to do it, without writing too much code. Is it possible?
Here's a simple Python solution:
import re
FIELD = re.compile(r"(\w+)\.*:(.*)", re.DOTALL) # split the line to name:value
VALUES = re.compile(r"(?<==).*?(?=\s|$)") # match individual values from http_req_* fields
# open the input file `k6_input.log` for reading, and k6_parsed.log` for parsing
with open("k6_input.log", "r") as f_in, open("k6_parsed.log", "w") as f_out:
for line in f_in: # read the input file line by line
field = FIELD.match(line) # first match all <field_name>...:<values> fields
if field:
name = field.group(1) # get the field name from the first capture group
f_out.write(name + ": ") # write the field name to the output file
value = field.group(2) # get the field value from the second capture group
if name[:9] == "http_req_": # parse out only http_req_* fields
f_out.write(", ".join(VALUES.findall(value)) + "\n") # extract the values
else: # verbatim copy of other fields
f_out.write(value)
else: # encountered unrecognizable field, just copy the line
f_out.write(line)
For a file with contents as above you'll get a resulting:
data_received: 246 kB 21 kB/s
data_sent: 174 kB 15 kB/s
http_req_blocked: 26.24ms, 0s, 13.5ms, 145.27ms, 61.04ms, 70.04ms
http_req_connecting: 23.96ms, 0s, 12ms, 145.27ms, 57.03ms, 66.04ms
http_req_duration: 197.41ms, 70.32ms, 91.56ms, 619.44ms, 288.2ms, 326.23ms
http_req_receiving: 141.82µs, 0s, 0s, 1ms, 1ms, 1ms
http_req_sending: 8.15ms, 0s, 0s, 334.23ms, 1ms, 1ms
http_req_waiting: 189.12ms, 70.04ms, 91.06ms, 343.42ms, 282.2ms, 309.22ms
http_reqs: 190 16.054553/s
iterations: 5 0.422488/s
vus: 200 min=200 max=200
vus_max: 200 min=200 max=200
If you have to run it over many files, I'd suggest you to investigate os.glob(), os.walk() or os.listdir() to list all the files you need and then loop over them and execute the above, thus further automating the process.

Problems importing strings to form a path from a .csv file

I am referring to this question I posted days ago, I haven't' get any replies yet and I suspect that the situation was not properly described, I made a simpler set up that would be easier to understand, and hopefully. get more attention from the experienced programmers!
I forgot to mention, I am running Python 2 on Jupyter
import pandas as pd
from pandas import Series, DataFrame
g_input_df = pd.read_csv('SetsLoc.csv')
URL=g_input_df.iloc[0,0]
c_input_df = pd.read_csv(URL)
c_input_df = c_input_df.set_index("Parameter")
root_path = c_input_df.loc["root_1"]
input_rel_path = c_input_df.loc["root_2"]
input_file_name = c_input_df.loc["file_name"]
This section reads from a .csv a list of paths, just one at a time, each one of them directing to another .csv file that contains the input for a simulation to be set-up using python.
The results from the above code can be tested here:
c_input_df
Value Parameter
root_1 C:/SimpleTest/
root_2 Input/
file_name Prop_1.csv
URL
'C:/SimpleTest/Sets/Set_1.csv'
root_path+input_rel_path+input_file_name
Value C:/SimpleTest/Input/Prop_1.csv
dtype: object
Property_1 = pd.read_csv('C:/SimpleTest/Input/Prop_1.csv')
Property_1
height weight
0 100 50
1 110 44
2 98 42
...on the other hand, if I try to use varibales to describe the file's path and name I get an error:
Property_1 = pd.read_csv(root_path+input_rel_path+input_file_name)
Property_1
I get the following error:
ValueErrorTraceback (most recent call last)
<ipython-input-3-1d5306b6bdb5> in <module>()
----> 1 Property_1 = pd.read_csv(root_path+input_rel_path+input_file_name)
2 Property_1
C:\ProgramData\Anaconda2\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
653 skip_blank_lines=skip_blank_lines)
654
--> 655 return _read(filepath_or_buffer, kwds)
656
657 parser_f.__name__ = name
C:\ProgramData\Anaconda2\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
390 compression = _infer_compression(filepath_or_buffer, compression)
391 filepath_or_buffer, _, compression = get_filepath_or_buffer(
--> 392 filepath_or_buffer, encoding, compression)
393 kwds['compression'] = compression
394
C:\ProgramData\Anaconda2\lib\site-packages\pandas\io\common.pyc in get_filepath_or_buffer(filepath_or_buffer, encoding, compression)
208 if not is_file_like(filepath_or_buffer):
209 msg = "Invalid file path or buffer object type: {_type}"
--> 210 raise ValueError(msg.format(_type=type(filepath_or_buffer)))
211
212 return filepath_or_buffer, None, compression
ValueError: Invalid file path or buffer object type: <class 'pandas.core.series.Series'>}
I beleive that the problem resides in the way the parameters that make up the path and filenemae are read from the dataframe, is there any way to specify that those parameters are paths, or something similar that will avoid this problem?
Any help is highly appreciated!
I posted the solution in the other question related to this post, in case someone wants to get a look:
Problems opening a path in Python

How do I set the column width of a pexpect ssh session?

I am writing a simple python script to connect to a SAN via SSH, run a set of commands. Ultimately each command will be logged to a separate log along with a timestamp, and then exit. This is because the device we are connecting to doesn't support certificate ssh connections, and doesn't have decent logging capabilities on its current firmware revision.
The issue that I seem to be running into is that the SSH session that is created seems to be limited to 78 characters wide. The results generated from each command are significantly wider - 155 characters. This is causing a bunch of funkiness.
First, the results in their current state are significantly more difficult to parse. Second, because the buffer is significantly smaller, the final volume command won't execute properly because the pexpect launched SSH session actually gets prompted to "press any key to continue".
How do I change the column width of the pexpect session?
Here is the current code (it works but is incomplete):
#!/usr/bin/python
import pexpect
import os
PASS='mypassword'
HOST='1.2.3.4'
LOGIN_COMMAND='ssh manage#'+HOST
CTL_COMMAND='show controller-statistics'
VDISK_COMMAND='show vdisk-statistics'
VOL_COMMAND='show volume-statistics'
VDISK_LOG='vdisk.log'
VOLUME_LOG='volume.log'
CONTROLLER_LOG='volume.log'
DATE=os.system('date +%Y%m%d%H%M%S')
child=pexpect.spawn(LOGIN_COMMAND)
child.setecho(True)
child.logfile = open('FetchSan.log','w+')
child.expect('Password: ')
child.sendline(PASS)
child.expect('# ')
child.sendline(CTL_COMMAND)
print child.before
child.expect('# ')
child.sendline(VDISK_COMMAND)
print child.before
child.expect('# ')
print "Sending "+VOL_COMMAND
child.sendline(VOL_COMMAND)
print child.before
child.expect('# ')
child.sendline('exit')
child.expect(pexpect.EOF)
print child.before
The output expected:
# show controller-statistics
Durable ID CPU Load Power On Time (Secs) Bytes per second IOPS Number of Reads Number of Writes Data Read Data Written
---------------------------------------------------------------------------------------------------------------------------------------------------------
controller_A 0 45963169 1573.3KB 67 386769785 514179976 6687.8GB 5750.6GB
controller_B 20 45963088 4627.4KB 421 3208370173 587661282 63.9TB 5211.2GB
---------------------------------------------------------------------------------------------------------------------------------------------------------
Success: Command completed successfully.
# show vdisk-statistics
Name Serial Number Bytes per second IOPS Number of Reads Number of Writes Data Read Data Written
------------------------------------------------------------------------------------------------------------------------------------------------
CRS 00c0ff13349e000006d5c44f00000000 0B 0 45861 26756 3233.0MB 106.2MB
DATA 00c0ff1311f300006dd7c44f00000000 2282.4KB 164 23229435 76509765 5506.7GB 1605.3GB
DATA1 00c0ff1311f3000087d8c44f00000000 2286.5KB 167 23490851 78314374 5519.0GB 1603.8GB
DATA2 00c0ff1311f30000c2f8ce5700000000 0B 0 26 4 1446.9KB 65.5KB
FRA 00c0ff13349e000001d8c44f00000000 654.8KB 5 3049980 15317236 1187.3GB 1942.1GB
FRA1 00c0ff13349e000007d9c44f00000000 778.7KB 6 3016569 15234734 1179.3GB 1940.4GB
------------------------------------------------------------------------------------------------------------------------------------------------
Success: Command completed successfully.
# show volume-statistics
Name Serial Number Bytes per second IOPS Number of Reads Number of Writes Data Read Data Written
-----------------------------------------------------------------------------------------------------------------------------------------------------
CRS_v001 00c0ff13349e0000fdd6c44f01000000 14.8KB 5 239611146 107147564 1321.1GB 110.5GB
DATA1_v001 00c0ff1311f30000d0d8c44f01000000 2402.8KB 218 1701488316 336678620 33.9TB 3184.6GB
DATA2_v001 00c0ff1311f3000040f9ce5701000000 0B 0 921 15 2273.7KB 2114.0KB
DATA_v001 00c0ff1311f30000bdd7c44f01000000 2303.4KB 209 1506883611 250984824 30.0TB 2026.6GB
FRA1_v001 00c0ff13349e00001ed9c44f01000000 709.1KB 28 25123082 161710495 1891.0GB 2230.0GB
FRA_v001 00c0ff13349e00001fd8c44f01000000 793.0KB 34 122052720 245322281 3475.7GB 3410.0GB
-----------------------------------------------------------------------------------------------------------------------------------------------------
Success: Command completed successfully.
The output as printed to the terminal (as mentioned, the 3rd command won't execute in its current state):
show controller-statistics
Durable ID CPU Load Power On Time (Secs) Bytes per second
IOPS Number of Reads Number of Writes Data Read
Data Written
----------------------------------------------------------------------
controller_A 3 45962495 3803.1KB
73 386765821 514137947 6687.8GB
5748.9GB
controller_B 20 45962413 5000.7KB
415 3208317860 587434274 63.9TB
5208.8GB
----------------------------------------------------------------------
Success: Command completed successfully.
Sending show volume-statistics
show vdisk-statistics
Name Serial Number Bytes per second IOPS
Number of Reads Number of Writes Data Read Data Written
----------------------------------------------------------------------------
CRS 00c0ff13349e000006d5c44f00000000 0B 0
45861 26756 3233.0MB 106.2MB
DATA 00c0ff1311f300006dd7c44f00000000 2187.2KB 152
23220764 76411017 5506.3GB 1604.1GB
DATA1 00c0ff1311f3000087d8c44f00000000 2295.2KB 154
23481442 78215540 5518.5GB 1602.6GB
DATA2 00c0ff1311f30000c2f8ce5700000000 0B 0
26 4 1446.9KB 65.5KB
FRA 00c0ff13349e000001d8c44f00000000 1829.3KB 14
3049951 15310681 1187.3GB 1941.2GB
FRA1 00c0ff13349e000007d9c44f00000000 1872.8KB 14
3016521 15228157 1179.3GB 1939.5GB
----------------------------------------------------------------------------
Success: Command completed successfully.
Traceback (most recent call last):
File "./fetchSAN.py", line 34, in <module>
child.expect('# ')
File "/Library/Python/2.7/site-packages/pexpect-4.2.1-py2.7.egg/pexpect/spawnbase.py", line 321, in expect
timeout, searchwindowsize, async)
File "/Library/Python/2.7/site-packages/pexpect-4.2.1-py2.7.egg/pexpect/spawnbase.py", line 345, in expect_list
return exp.expect_loop(timeout)
File "/Library/Python/2.7/site-packages/pexpect-4.2.1-py2.7.egg/pexpect/expect.py", line 107, in expect_loop
return self.timeout(e)
File "/Library/Python/2.7/site-packages/pexpect-4.2.1-py2.7.egg/pexpect/expect.py", line 70, in timeout
raise TIMEOUT(msg)
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x105333910>
command: /usr/bin/ssh
args: ['/usr/bin/ssh', 'manage#10.254.27.49']
buffer (last 100 chars): '-------------------------------------------------------------\r\nPress any key to continue (Q to quit)'
before (last 100 chars): '-------------------------------------------------------------\r\nPress any key to continue (Q to quit)'
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 19519
child_fd: 5
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: <open file 'FetchSan.log', mode 'w+' at 0x1053321e0>
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_re:
0: re.compile("# ")
And here is what is captured in the log:
Password: mypassword
HP StorageWorks MSA Storage P2000 G3 FC
System Name: Uninitialized Name
System Location:Uninitialized Location
Version:TS230P008
# show controller-statistics
show controller-statistics
Durable ID CPU Load Power On Time (Secs) Bytes per second
IOPS Number of Reads Number of Writes Data Read
Data Written
----------------------------------------------------------------------
controller_A 3 45962495 3803.1KB
73 386765821 514137947 6687.8GB
5748.9GB
controller_B 20 45962413 5000.7KB
415 3208317860 587434274 63.9TB
5208.8GB
----------------------------------------------------------------------
Success: Command completed successfully.
# show vdisk-statistics
show vdisk-statistics
Name Serial Number Bytes per second IOPS
Number of Reads Number of Writes Data Read Data Written
----------------------------------------------------------------------------
CRS 00c0ff13349e000006d5c44f00000000 0B 0
45861 26756 3233.0MB 106.2MB
DATA 00c0ff1311f300006dd7c44f00000000 2187.2KB 152
23220764 76411017 5506.3GB 1604.1GB
DATA1 00c0ff1311f3000087d8c44f00000000 2295.2KB 154
23481442 78215540 5518.5GB 1602.6GB
DATA2 00c0ff1311f30000c2f8ce5700000000 0B 0
26 4 1446.9KB 65.5KB
FRA 00c0ff13349e000001d8c44f00000000 1829.3KB 14
3049951 15310681 1187.3GB 1941.2GB
FRA1 00c0ff13349e000007d9c44f00000000 1872.8KB 14
3016521 15228157 1179.3GB 1939.5GB
----------------------------------------------------------------------------
Success: Command completed successfully.
# show volume-statistics
show volume-statistics
Name Serial Number Bytes per second
IOPS Number of Reads Number of Writes Data Read
Data Written
----------------------------------------------------------------------
CRS_v001 00c0ff13349e0000fdd6c44f01000000 11.7KB
5 239609039 107145979 1321.0GB
110.5GB
DATA1_v001 00c0ff1311f30000d0d8c44f01000000 2604.5KB
209 1701459941 336563041 33.9TB
3183.3GB
DATA2_v001 00c0ff1311f3000040f9ce5701000000 0B
0 921 15 2273.7KB
2114.0KB
DATA_v001 00c0ff1311f30000bdd7c44f01000000 2382.8KB
194 1506859273 250871273 30.0TB
2025.4GB
FRA1_v001 00c0ff13349e00001ed9c44f01000000 1923.5KB
31 25123006 161690520 1891.0GB
2229.1GB
FRA_v001 00c0ff13349e00001fd8c44f01000000 2008.5KB
37 122050872 245301514 3475.7GB
3409.1GB
----------------------------------------------------------------------
Press any key to continue (Q to quit)%
As a starting point: According to the manual, that SAN has a command to disable the pager. See the documentation for set cli-parameters pager off. It may be sufficient to execute that command. It may also have a command to set the terminal rows and columns that it uses for formatting output, although I wasn't able to find one.
Getting to your question: When an ssh client connects to a server and requests an interactive session, it can optionally request a PTY (pseudo-tty) for the server side of the session. When it does that, it informs the server of the lines, columns, and terminal type which the server should use for the TTY. Your SAN may honor PTY requests and use the lines and columns values to format its output. Or it may not.
The ssh client gets the rows and columns for the PTY request from the TTY for its standard input. This is the PTY which pexpect is using to communicate with ssh.
this question discusses how to set the terminal size for a pexpect session. Ssh doesn't honor the LINES or COLUMNS environment variables as far as I can tell, so I doubt that would work. However, calling child.setwinsize() after spawning ssh ought to work:
child = pexpect.spawn(cmd)
child.setwinsize(400,400)
If you have trouble with this, you could try setting the terminal size by invoking stty locally before ssh:
child=pexpect.spawn('stty rows x cols y; ssh user#host')
Finally, you need to make sure that ssh actually requests a PTY for the session. It does this by default in some cases, which should include the way you are running it. But it has a command-line option -tt to force it to allocate a PTY. You could add that option to the ssh command line to make sure:
child=pexpect.spawn('ssh -tt user#host')
or
child=pexpect.spawn('stty rows x cols y; ssh -tt user#host')

Forvalues dropping leading 0's, how to fix?

I am attempting to create a loop to save me having to type out the code many times. Essentially, I have 60 csv files that I need to alter and save. My code looks as follows:
forvalues i = 0203 0206 : 1112 {
cd "C:\Users\User\Desktop\Data\"
import delimited `i'.csv, varnames(1)
gen time=`i'
keep rssd9017 rssd9010 bhck4074 bhck4079 bhck4093 bhck2170 time
save `i'.dta, replace
}
However, I am getting the error "203.csv" does not exist. It seems to be dropping the leading 0, any way to fix this?
You are asking for a numlist, but in this context 0203, with nothing else said, just looks to Stata like a quirky but acceptable way to write 203: hence your problem.
But do you really have a numlist that is 0203 0206 : 1112?
Try it:
numlist "0203 0206 : 1112"
ret li
The list starts 203 206 209 212 215 218 221 224 227 230 233 236 ...
My wild guess is that you have files, one for each quarter over a period, labelled 0203 for March 2002 through to 1112 for December 2011. In fact you do say that you have times, even though my guess implies 40 files, not 60. If so, that means you won't have a file that is labelled 0215, so this is the wrong way to think in any case.
Here is a better approach. First take the cd out of the loop: you need only do that once!
cd "C:\Users\User\Desktop\Data"
Now find the files that are ????.csv. You need only install fs once.
ssc inst fs
fs ????.csv
foreach f in `r(files)' {
import delimited `f', varnames(1)
gen time = substr("`f'", 1, 4)
keep rssd9017 rssd9010 bhck4074 bhck4079 bhck4093 bhck2170 time
save `time'.dta, replace
}
On my guess, you still need to fix the time to something civilised and you would be better off appending the files, but one problem at a time.
Note that insisting on leading zeros, which you think is the problem here, but is probably a red herring, is written up here.

How optimize word counting in Python?

I'm taking my first steps writing code to do linguistic analysis of texts. I use Python and the NLTK library. The problem is that the actual counting of words takes up close to 100 % of my CPU (iCore5, 8GB RAM, macbook air 2014) and ran for 14 hours before I shut the process down. How can I speed the looping and counting up?
I have created a corpus in NLTK out of three Swedish UTF-8 formatted, tab-separated files Swe_Newspapers.txt, Swe_Blogs.txt, Swe_Twitter.txt. It works fine:
import nltk
my_corpus = nltk.corpus.CategorizedPlaintextCorpusReader(".", r"Swe_.*", cat_pattern=r"Swe_(\w+)\.txt")
Then I've loaded a text-file with one word per line into NLTK. That also works fine.
my_wordlist = nltk.corpus.WordListCorpusReader("/Users/mos/Documents/", "wordlist.txt")
The text-file I want to analyse (Swe_Blogs.txt) has this structure, and works fine to parse:
Wordpress.com 2010/12/08 3 1,4,11 osv osv osv …
bloggagratis.se 2010/02/02 3 0 Jag är utled på plogade vägar, matte är lika utled hon.
wordpress.com 2010/03/10 3 0 1 kruka Sallad, riven
EDIT: The suggestion to produce the counter as below, does not work, but can be fixed:
counter = collections.Counter(word for word in my_corpus.words(categories=["Blogs"]) if word in my_wordlist)
This produces the error:
IOError Traceback (most recent call last)
<ipython-input-41-1868952ba9b1> in <module>()
----> 1 counter = collections.Counter(word for word in my_corpus.words("Blogs") if word in my_wordlist)
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, categories)
182 def words(self, fileids=None, categories=None):
183 return PlaintextCorpusReader.words(
--> 184 self, self._resolve(fileids, categories))
185 def sents(self, fileids=None, categories=None):
186 return PlaintextCorpusReader.sents(
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, sourced)
89 encoding=enc)
90 for (path, enc, fileid)
---> 91 in self.abspaths(fileids, True, True)])
92
93
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/api.pyc in abspaths(self, fileids, include_encoding, include_fileid)
165 fileids = [fileids]
166
--> 167 paths = [self._root.join(f) for f in fileids]
168
169 if include_encoding and include_fileid:
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/ lib/python2.7/site-packages/nltk/data.pyc in join(self, fileid)
174 def join(self, fileid):
175 path = os.path.join(self._path, *fileid.split('/'))
--> 176 return FileSystemPathPointer(path)
177
178 def __repr__(self):
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/ lib/python2.7/site-packages/nltk/data.pyc in __init__(self, path)
152 path = os.path.abspath(path)
153 if not os.path.exists(path):
--> 154 raise IOError('No such file or directory: %r' % path)
155 self._path = path
IOError: No such file or directory: '/Users/mos/Documents/Blogs'
A fix is to assign my_corpus(categories=["Blogs"] to a variable:
blogs_text = my_corpus.words(categories=["Blogs"])
It's when I try to count all occurrences of each word (about 20K words) in the wordlist within the blogs in the corpus (115,7 MB) that my computer get's a little tired. How can I speed up the following code? It seems to work, no error messages, but it takes >14h to execute.
import collections
counter = collections.Counter()
for word in my_corpus.words(categories="Blogs"):
for token in my_wordlist.words():
if token == word:
counter[token]+=1
else:
continue
Any help to improve my coding skills is much appreciated!
It seems like your double loop could be improved:
for word in mycorp.words(categories="Blogs"):
for token in my_wordlist.words():
if token == word:
counter[token]+=1
This would be much faster as:
words = set(my_wordlist.words()) # call once, make set for fast check
for word in mycorp.words(categories="Blogs"):
if word in words:
counter[word] += 1
This takes you from doing len(my_wordlist.words()) * len(mycorp.words(...)) operations to closer to len(my_wordlist.words()) + len(mycorp.words(...)) operations, as building the set is O(n) and checking whether a word is in the set is O(1) on average.
You can also build the Counter direct from an iterable, as Two-Bit Alchemist points out:
counter = Counter(word for word in mycorp.words(categories="Blogs")
if word in words)
You already got good answers on how to count words properly with Python. The problem is that it will still be quite slow. If you are just exploring the corpora, using a chain of UNIX tools gives you a much quicker result. Assuming that your text is tokenized, something like this gives you the first 100 tokens in descending order:
cat Swe_Blogs.txt | cut --delimiter='\t' --fields=5 | tr ' ' '\n' | sort | uniq -c | sort -nr | head -n 100