Parse output from k6 data to get specific information - regex

I am trying to extract data from a k6 output (https://docs.k6.io/docs/results-output):
data_received.........: 246 kB 21 kB/s
data_sent.............: 174 kB 15 kB/s
http_req_blocked......: avg=26.24ms min=0s med=13.5ms max=145.27ms p(90)=61.04ms p(95)=70.04ms
http_req_connecting...: avg=23.96ms min=0s med=12ms max=145.27ms p(90)=57.03ms p(95)=66.04ms
http_req_duration.....: avg=197.41ms min=70.32ms med=91.56ms max=619.44ms p(90)=288.2ms p(95)=326.23ms
http_req_receiving....: avg=141.82µs min=0s med=0s max=1ms p(90)=1ms p(95)=1ms
http_req_sending......: avg=8.15ms min=0s med=0s max=334.23ms p(90)=1ms p(95)=1ms
http_req_waiting......: avg=189.12ms min=70.04ms med=91.06ms max=343.42ms p(90)=282.2ms p(95)=309.22ms
http_reqs.............: 190 16.054553/s
iterations............: 5 0.422488/s
vus...................: 200 min=200 max=200
vus_max...............: 200 min=200 max=200
The data comes in the above format and I am trying to find a way to get each line in the above along with the values only. As an example:
http_req_duration: 197.41ms, 70.32ms,91.56ms, 619.44ms, 288.2ms, 326.23ms
I have to do this for ~50-100 files and want to find a RegEx or similar quicker way to do it, without writing too much code. Is it possible?

Here's a simple Python solution:
import re
FIELD = re.compile(r"(\w+)\.*:(.*)", re.DOTALL) # split the line to name:value
VALUES = re.compile(r"(?<==).*?(?=\s|$)") # match individual values from http_req_* fields
# open the input file `k6_input.log` for reading, and k6_parsed.log` for parsing
with open("k6_input.log", "r") as f_in, open("k6_parsed.log", "w") as f_out:
for line in f_in: # read the input file line by line
field = FIELD.match(line) # first match all <field_name>...:<values> fields
if field:
name = field.group(1) # get the field name from the first capture group
f_out.write(name + ": ") # write the field name to the output file
value = field.group(2) # get the field value from the second capture group
if name[:9] == "http_req_": # parse out only http_req_* fields
f_out.write(", ".join(VALUES.findall(value)) + "\n") # extract the values
else: # verbatim copy of other fields
f_out.write(value)
else: # encountered unrecognizable field, just copy the line
f_out.write(line)
For a file with contents as above you'll get a resulting:
data_received: 246 kB 21 kB/s
data_sent: 174 kB 15 kB/s
http_req_blocked: 26.24ms, 0s, 13.5ms, 145.27ms, 61.04ms, 70.04ms
http_req_connecting: 23.96ms, 0s, 12ms, 145.27ms, 57.03ms, 66.04ms
http_req_duration: 197.41ms, 70.32ms, 91.56ms, 619.44ms, 288.2ms, 326.23ms
http_req_receiving: 141.82µs, 0s, 0s, 1ms, 1ms, 1ms
http_req_sending: 8.15ms, 0s, 0s, 334.23ms, 1ms, 1ms
http_req_waiting: 189.12ms, 70.04ms, 91.06ms, 343.42ms, 282.2ms, 309.22ms
http_reqs: 190 16.054553/s
iterations: 5 0.422488/s
vus: 200 min=200 max=200
vus_max: 200 min=200 max=200
If you have to run it over many files, I'd suggest you to investigate os.glob(), os.walk() or os.listdir() to list all the files you need and then loop over them and execute the above, thus further automating the process.

Related

Problems importing strings to form a path from a .csv file

I am referring to this question I posted days ago, I haven't' get any replies yet and I suspect that the situation was not properly described, I made a simpler set up that would be easier to understand, and hopefully. get more attention from the experienced programmers!
I forgot to mention, I am running Python 2 on Jupyter
import pandas as pd
from pandas import Series, DataFrame
g_input_df = pd.read_csv('SetsLoc.csv')
URL=g_input_df.iloc[0,0]
c_input_df = pd.read_csv(URL)
c_input_df = c_input_df.set_index("Parameter")
root_path = c_input_df.loc["root_1"]
input_rel_path = c_input_df.loc["root_2"]
input_file_name = c_input_df.loc["file_name"]
This section reads from a .csv a list of paths, just one at a time, each one of them directing to another .csv file that contains the input for a simulation to be set-up using python.
The results from the above code can be tested here:
c_input_df
Value Parameter
root_1 C:/SimpleTest/
root_2 Input/
file_name Prop_1.csv
URL
'C:/SimpleTest/Sets/Set_1.csv'
root_path+input_rel_path+input_file_name
Value C:/SimpleTest/Input/Prop_1.csv
dtype: object
Property_1 = pd.read_csv('C:/SimpleTest/Input/Prop_1.csv')
Property_1
height weight
0 100 50
1 110 44
2 98 42
...on the other hand, if I try to use varibales to describe the file's path and name I get an error:
Property_1 = pd.read_csv(root_path+input_rel_path+input_file_name)
Property_1
I get the following error:
ValueErrorTraceback (most recent call last)
<ipython-input-3-1d5306b6bdb5> in <module>()
----> 1 Property_1 = pd.read_csv(root_path+input_rel_path+input_file_name)
2 Property_1
C:\ProgramData\Anaconda2\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
653 skip_blank_lines=skip_blank_lines)
654
--> 655 return _read(filepath_or_buffer, kwds)
656
657 parser_f.__name__ = name
C:\ProgramData\Anaconda2\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
390 compression = _infer_compression(filepath_or_buffer, compression)
391 filepath_or_buffer, _, compression = get_filepath_or_buffer(
--> 392 filepath_or_buffer, encoding, compression)
393 kwds['compression'] = compression
394
C:\ProgramData\Anaconda2\lib\site-packages\pandas\io\common.pyc in get_filepath_or_buffer(filepath_or_buffer, encoding, compression)
208 if not is_file_like(filepath_or_buffer):
209 msg = "Invalid file path or buffer object type: {_type}"
--> 210 raise ValueError(msg.format(_type=type(filepath_or_buffer)))
211
212 return filepath_or_buffer, None, compression
ValueError: Invalid file path or buffer object type: <class 'pandas.core.series.Series'>}
I beleive that the problem resides in the way the parameters that make up the path and filenemae are read from the dataframe, is there any way to specify that those parameters are paths, or something similar that will avoid this problem?
Any help is highly appreciated!
I posted the solution in the other question related to this post, in case someone wants to get a look:
Problems opening a path in Python

Python, calculating time difference

I'm parsing logs generated from multiple sources and joined together to form a huge log file in the following format;
My_testNumber: 14, JobType = testx.
ABC 2234
**SR 111**
1483529571 1 1 Wed Jan 4 11:32:51 2017 0 4
datatype someRandomValue
SourceCode.Cpp 588
DBConnection failed
TB 132
**SR 284**
1483529572 0 1 Wed Jan 4 11:32:52 2017 5010400 4
datatype someRandomXX
SourceCode2.cpp 455
DBConnection Success
TB 102
**SR 299**
1483529572 0 1 **Wed Jan 4 11:32:54 2017** 5010400 4
datatype someRandomXX
SourceCode3.cpp 455
ConnectionManager Success
....
(there are dozens of SR Numbers here)
Now i'm looking a smart way to parse logs so that it calculates time differences in seconds for each testNumber and SR number
like
My_testNumber:14 it subtracts SR 284 and SR 111 time (difference would be 1 second here), for SR 284 and 299 it is 2 seconds and so on.
You can parse your posted log file and save the corresponding data accordingly. Then, you can work with the data to get the time differences. The following should be a decent start:
from itertools import combinations
from itertools import permutations # if order matters
from collections import OrderedDict
from datetime import datetime
import re
sr_numbers = []
dates = []
# Loop through the file and get the test number and times
# Save the data in a list
pattern = re.compile(r"(.*)\*{2}(.*)\*{2}(.*)")
for line in open('/Path/to/log/file'):
if '**' in line:
# Get the data between the asterisks
if 'SR' in line:
sr_numbers.append(re.sub(pattern,"\\2", line.strip()))
else:
dates.append(datetime.strptime(re.sub(pattern,"\\2", line.strip()), '%a %b %d %H:%M:%S %Y'))
else:
continue
# Use hashmap container (ordered dictionary) to make it easy to get the time differences
# Using OrderedDict here to maintain the order of the order of the test number along the file
log_dict = OrderedDict((k,v) for k,v in zip(sr_numbers, dates))
# Use combinations to get the possible combinations (or permutations if order matters) of time differences
time_differences = {"{} - {}".format(*x):(log_dict[x[1]] - log_dict[x[0]]).seconds for x in combinations(log_dict, 2)}
print(time_differences)
# {'SR 284 - SR 299': 2, 'SR 111 - SR 284': 1, 'SR 111 - SR 299': 3}
Edit:
Parsing the file without relying on the asterisks around the dates:
from itertools import combinations
from itertools import permutations # if order matters
from collections import OrderedDict
from datetime import datetime
import re
sr_numbers = []
dates = []
# Loop through the file and get the test number and times
# Save the data in a list
pattern = re.compile(r"(.*)\*{2}(.*)\*{2}(.*)")
for line in open('/Path/to/log/file'):
if 'SR' in line:
current_sr_number = re.sub(pattern,"\\2", line.strip())
sr_numbers.append(current_sr_number)
elif line.strip().count(":") > 1:
try:
dates.append(datetime.strptime(re.split("\s{3,}",line)[2].strip("*"), '%a %b %d %H:%M:%S %Y'))
except IndexError:
#print(re.split("\s{3,}",line))
dates.append(datetime.strptime(re.split("\t+",line)[2].strip("*"), '%a %b %d %H:%M:%S %Y'))
else:
continue
# Use hashmap container (ordered dictionary) to make it easy to get the time differences
# Using OrderedDict here to maintain the order of the order of the test number along the file
log_dict = OrderedDict((k,v) for k,v in zip(sr_numbers, dates))
# Use combinations to get the possible combinations (or permutations if order matters) of time differences
time_differences = {"{} - {}".format(*x):(log_dict[x[1]] - log_dict[x[0]]).seconds for x in combinations(log_dict, 2)}
print(time_differences)
# {'SR 284 - SR 299': 2, 'SR 111 - SR 284': 1, 'SR 111 - SR 299': 3}
I hope this proves useful.

How optimize word counting in Python?

I'm taking my first steps writing code to do linguistic analysis of texts. I use Python and the NLTK library. The problem is that the actual counting of words takes up close to 100 % of my CPU (iCore5, 8GB RAM, macbook air 2014) and ran for 14 hours before I shut the process down. How can I speed the looping and counting up?
I have created a corpus in NLTK out of three Swedish UTF-8 formatted, tab-separated files Swe_Newspapers.txt, Swe_Blogs.txt, Swe_Twitter.txt. It works fine:
import nltk
my_corpus = nltk.corpus.CategorizedPlaintextCorpusReader(".", r"Swe_.*", cat_pattern=r"Swe_(\w+)\.txt")
Then I've loaded a text-file with one word per line into NLTK. That also works fine.
my_wordlist = nltk.corpus.WordListCorpusReader("/Users/mos/Documents/", "wordlist.txt")
The text-file I want to analyse (Swe_Blogs.txt) has this structure, and works fine to parse:
Wordpress.com 2010/12/08 3 1,4,11 osv osv osv …
bloggagratis.se 2010/02/02 3 0 Jag är utled på plogade vägar, matte är lika utled hon.
wordpress.com 2010/03/10 3 0 1 kruka Sallad, riven
EDIT: The suggestion to produce the counter as below, does not work, but can be fixed:
counter = collections.Counter(word for word in my_corpus.words(categories=["Blogs"]) if word in my_wordlist)
This produces the error:
IOError Traceback (most recent call last)
<ipython-input-41-1868952ba9b1> in <module>()
----> 1 counter = collections.Counter(word for word in my_corpus.words("Blogs") if word in my_wordlist)
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, categories)
182 def words(self, fileids=None, categories=None):
183 return PlaintextCorpusReader.words(
--> 184 self, self._resolve(fileids, categories))
185 def sents(self, fileids=None, categories=None):
186 return PlaintextCorpusReader.sents(
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, sourced)
89 encoding=enc)
90 for (path, enc, fileid)
---> 91 in self.abspaths(fileids, True, True)])
92
93
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/api.pyc in abspaths(self, fileids, include_encoding, include_fileid)
165 fileids = [fileids]
166
--> 167 paths = [self._root.join(f) for f in fileids]
168
169 if include_encoding and include_fileid:
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/ lib/python2.7/site-packages/nltk/data.pyc in join(self, fileid)
174 def join(self, fileid):
175 path = os.path.join(self._path, *fileid.split('/'))
--> 176 return FileSystemPathPointer(path)
177
178 def __repr__(self):
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/ lib/python2.7/site-packages/nltk/data.pyc in __init__(self, path)
152 path = os.path.abspath(path)
153 if not os.path.exists(path):
--> 154 raise IOError('No such file or directory: %r' % path)
155 self._path = path
IOError: No such file or directory: '/Users/mos/Documents/Blogs'
A fix is to assign my_corpus(categories=["Blogs"] to a variable:
blogs_text = my_corpus.words(categories=["Blogs"])
It's when I try to count all occurrences of each word (about 20K words) in the wordlist within the blogs in the corpus (115,7 MB) that my computer get's a little tired. How can I speed up the following code? It seems to work, no error messages, but it takes >14h to execute.
import collections
counter = collections.Counter()
for word in my_corpus.words(categories="Blogs"):
for token in my_wordlist.words():
if token == word:
counter[token]+=1
else:
continue
Any help to improve my coding skills is much appreciated!
It seems like your double loop could be improved:
for word in mycorp.words(categories="Blogs"):
for token in my_wordlist.words():
if token == word:
counter[token]+=1
This would be much faster as:
words = set(my_wordlist.words()) # call once, make set for fast check
for word in mycorp.words(categories="Blogs"):
if word in words:
counter[word] += 1
This takes you from doing len(my_wordlist.words()) * len(mycorp.words(...)) operations to closer to len(my_wordlist.words()) + len(mycorp.words(...)) operations, as building the set is O(n) and checking whether a word is in the set is O(1) on average.
You can also build the Counter direct from an iterable, as Two-Bit Alchemist points out:
counter = Counter(word for word in mycorp.words(categories="Blogs")
if word in words)
You already got good answers on how to count words properly with Python. The problem is that it will still be quite slow. If you are just exploring the corpora, using a chain of UNIX tools gives you a much quicker result. Assuming that your text is tokenized, something like this gives you the first 100 tokens in descending order:
cat Swe_Blogs.txt | cut --delimiter='\t' --fields=5 | tr ' ' '\n' | sort | uniq -c | sort -nr | head -n 100

Extracting columnar data correctly as it is in the file

Suppose i have tabular column as below.Now i want to extract the column wise data.I tried extracting data by creating a list.But it is extracting the first row correctly but from second row onwards there is space i.e under CEN/4.Now my code considers zeroth column has 5.0001e-1 form second row,it starts reading from there. How to extract the data correctly coulmn wise.output is scrambled.
0 1 25 CEN/4 -5.000000E-01 -3.607026E+04 -5.747796E+03 -8.912796E+02 -88.3178
5.000000E-01 3.607026E+04 5.747796E+03 8.912796E+02 1.6822
27 -5.000000E-01 -3.641444E+04 -5.783247E+03 -8.912796E+02 -88.3347
5.000000E-01 3.641444E+04 5.783247E+03 8.912796E+02 1.6653
28 -5.000000E-01 -3.641444E+04 -5.712346E+03 -8.912796E+02 -88.3386
5.000000E-01 3.641444E+04 5.712346E+03 8.912796E+02
my code is :
f1=open('newdata1.txt','w')
L = []
for index, line in enumerate(open('Trial_1.txt','r')):
#print index
if index < 0: #skip first 5 lines
continue
else:
line =line.split()
L.append('%s\t%s\t %s\n' %(line[0], line[1],line[2]))
f1.writelines(L)
f1.close()
my output looks like this:
0 1 CEN/4 -5.000000E-01 -5.120107E+04
5.000000E-01 5.120107E+04 1.028093E+04 5.979930E+03 8.1461
i want columnar data as it is in the file.How to do that.I am a bgeinner
its hard to tell from the way the input data is presented in your question, but Im guessing your file is using tabs to separate columns, in any case, consider using python csv module with the relevant delimiter like:
import csv
with open('input.csv') as f_in, open('newdata1', 'w') as f_out:
reader = csv.reader(f_in, delimiter='\t')
writer = csv.writer(f_out, delimiter='\t')
for row in reader:
writer.writerow(row)
see python csv module documentation for further details

Send output from iPython console to .csv file. (& viewing data issue)

Using the iPython console, I built a pandas dataframe called df.
for (k1,k2), group in df.groupby(['II','time']):
print k1,k2
print group
df['II'] stores integers between: [-10,10].
'time' can be either 930 or 1620
My goal is to save the output (of this loop) to a single .csv file. (Not great, but I copied and pasted the output to a csv. However, in doing so, I noticed that "II"== -1, at both times: 930/1620, do not appear in (full data view) like the others. (They both exist, though).
For example, for "II"== -1 # 930 it appears in the console as :
-1 930
<class 'pandas.core.frame.DataFrame'>
Int64Index: 268 entries, 2 to 2140
Data columns:
index 268 non-null values
date 268 non-null values
time 268 non-null values
price 268 non-null values
round5 268 non-null values
II 268 non-null values
Pattern 268 non-null values
pl 268 non-null values
dtypes: float64(2), int64(4), object(2)
With the knowledge that it exists, I tried brute force, pulling them manually:
u=df['II']== -1
one=df.groupby('time')[u]
#To check the result:
one.to_csv('file.csv')
I'm grouping by 'time', so all times should appear. Yet the resulting csv only contains the 1620 times--all results at 930 are, unfortunately, missing in action. It's bizarre. Your suggestions greatly appreciated.