Format of google-perftools/pprof with heap profiling - profiling

There is a pprof utility in google-perftools package. It is utility to convert profile files from google-perftools cpuprofiler and heapprofiler into beautiful images: like https://github.com/gperftools/gperftools/tree/master/doc/pprof-test-big.gif and
https://github.com/gperftools/gperftools/tree/master/doc/heap-example1.png
The format of pprof's input file is described for cpu profiles here: https://github.com/gperftools/gperftools/tree/master/doc/cpuprofile-fileformat.html
but the format of heap profile input files is not described in the svn.
What is the "heapprofiling" format and how can I generate such file from my code? I already can generate cpuprofiler format, so I interested what are the difference between two formats.

Seems the format is not binary as for cpu profiler, but textual:
First line:
heap profile: 1: 2 [ 3: 4] # heapprofile
Regex (not full)
(\d+): (\d+) \[(\d+): (\d+)\] # ([^/]*)(/(\d+))?)?
where
1 & 2 is "in-use stats"; first number is count of allocations, second is byte count
3 & 4 is "total allocated stats"; first and second with same meaning
heapprofile is the type
Then a profile itself follows in lot of lines:
1: 2 [ 3: 4] # 0x00001 0x00002 0x00003
where "1: 2" and "3: 4" is of the same meaning as in first line; but only malloced from given callsite; 0x00001 0x00002 is callstack of the callsite.
Then empty line and "MAPPED_LIBRARIES:". From the next line something very like copy of the /proc/pid/maps follows.

Related

Python: Match a special caracter with regular expression

Hi everyone I'm using the re.match function to extract pieces of string within a row from the file.
My code is as follows:
## fp_tmp => pointer of file
for x in fp_tmp:
try:
cpuOverall=re.match(r"(Overall CPU load average)\s+(\S+)(%)",x)
cpuUsed=re.match(r"(Total)\s+(\d+)(%)",x)
ramUsed=re.match(r"(RAM Utilization)\s+(\d+\%)",x)
####Not Work####
if cpuUsed is not None: cpuused_new=cpuUsed.group(2)
if ramUsed is not None: ramused_new=ramUsed.group(2)
if cpuOverall is not None: cpuoverall_new=cpuOverall.group(2)
except:
searchbox_result = None
Each field is extracted from the following corresponding line:
ramUsed => RAM Utilization 2%
cpuUsed => Total 4%
cpuOverall => Overall CPU load average 12%
ramUsed, cpuUsed, cpuOverall are the variable where I want write the result!!
Corretly line are:
(space undefined) RAM Utilization 2%
(space undefined) Total 4%
(space undefined) Overall CPU load average 12%
When I execute the script all variable return a value: None.
With other variable the script work corretly.
Why the code not work in this case? I use the python3
I think that the problem is a caracter % that not read.
Do you have any suggestions?
PROBLEM 2:
## fp_tmp => pointer of file
for x in fp_tmp:
try:
emailReceived=re.match(r".*(Messages Received)\s+\S+\s+\S+\s+(\S+)",x)
####Not Work####
if emailReceived is not None: emailreceived_new=emailReceived.group(2)
except:
searchbox_result = None
Each field is extracted from the following corresponding on 2 lines in a file:
[....]
Counters: Reset Uptime Lifetime
Receiving
Messages Received 3,406 1,558 3,406
[....]
Rates (Events Per Hour): 1-Minute 5-Minutes 15-Minutes
Receiving
Messages Received 0 0 0
Recipients Received 0 0 0
[....]
I want extract only second occured, that:
Rates (Events Per Hour): 1-Minute 5-Minutes 15-Minutes
Receiving
Messages Received 0 0 0 <-this
Do you have any suggestions?
cpuOverall line: you forgot that there is more information at the start of the line. Change to
'.*(Overall CPU load average)\s+(\S+%)'
cpuUsed line: you forgot that there is more information at the start of the line. Change to
'.*(Total)\s+(\d+%)'
ramUsed line: you forgot that there is more information at the start of the line... Change to
'.*(RAM Utilization)\s+(\d+%)'
Remember that re.match looks for an exact match from the start:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. [..]
With these changes, your three variables are set to the percentages:
>>> print (cpuused_new,ramused_new,cpuoverall_new)
4% 2% 12%

Get a string after a specific word, using a program that has limited regex features?

Looking for help on building a regex that captures a 1-line string after a specific word.
The challenge I'm running into is that the program where I need to build this regex uses a single line format, in other words dot matches new line. So the formula I created isn't working. See more details below. Any advice or tips?
More specific regex task:
I'm trying to grab the line that comes after the word Details from entries like below. The goal is pull out 100% Silk, or 100% Velvet. This is the material of the product that always comes after Details.
Raw data:
<p>Loose fitted blouse green/yellow lily print.
V-neck opening with a closure string.
Small tie string on left side of top.</p>
<h3>Details</h3> <p>100% Silk.</p>
<p>Made in Portugal.</p> <h3>Fit</h3>
<p>Model is 5‰Ûª10,‰Û size 2 wearing size 34.</p> <p>Size 34 measurements</p>
OR
<p>The velvet version of this dress. High waist fit with hook and zipper closure.
Seams run along edges of pants to create a box-like.</p>
<h3>Details</h3> <p>100% Velvet.</p>
<p>Made in the United States.</p>
<h3>Fit</h3> <p>Model is 5‰Ûª10‰Û, size 2 and wearing size M pants.</p> <p>Size M measurements Length: 37.5"åÊ</p>
<p>These pants run small. We recommend sizing up.</p>
Here is the current formula I created that's not working:
Replace (.)(\bDetails\s+(.)) with $3
The output gives the below:
<p>100% Silk.</p>
<p>Made in Portugal.</p>
<h3>Fit</h3>
<p>Model is 5‰Ûª10,‰Û size 2 wearing size 34.</p>
<p>Size 34 measurements</p>
OR
<p>100% Velvet.</p>
<p>Made in the United States.</p>
<h3>Fit</h3> <p>Model is 5‰Ûª10‰Û, size 2 and wearing size M pants.</p> <p>Size M measurements Length: 37.5"åÊ</p>
<p>These pants run small. We recommend sizing up.</p>
`
How do I capture just the desired string? Let me know if you have any tips! Thank you!
Difficult to provide a working solution in your situation as you mention your program has "limited regex features" but don't explain what limitations.
Here is a Regex you can try to work with to capture the target string
^(?:<h3>Details<\/h3>)(.*)$
I would personally use BeautifulSoup for something like this, but here are two solutions you could use:
Match the line after "Details", then pull out the data.
matches = re.findall('(?<=Details<).*$', text)
matches = [i.strip('<>') for i in matches]
matches = [i.split('<')[0] for i in [j.split('>')[-1] for j in matches]]
Replace "Details<...>data" with "Detailsdata", then find the data.
text = re.sub('Details<.*?<.*>', '', text)
matches = re.findall('(?<=Details).*?(?=<)', text)

In python insert one space after every 5th Character in each line of a text file

I am reading a text file in python(500 rows) and it seems like:
File Input:
0082335401
0094446049
01008544409
01037792084
01040763890
I wanted to ask that is it possible to insert one space after 5th Character in each line:
Desired Output:
00823 35401
00944 46049
01008 544409
01037 792084
01040 763890
I have tried below code
st = " ".join(st[i:i + 5] for i in range(0, len(st), 5))
but the below output was returned on executing it:
00823 35401
0094 44604 9
010 08544 409
0 10377 92084
0104 07638 90
I am a novice in Python. Any help would make a difference.
There seems to be two issues here - By running your provided code, you seem to be reading the file into one single string. It would be much preferable (in your case) to read the file in as a list of strings, like the following (assuming your input file is input_data.txt):
# Initialize a list for the data to be stored
data = []
# Iterate through your file to read the data
with open("input_data.txt") as f:
for line in f.readlines():
# Use .rstrip() to get rid of the newline character at the end
data.append(line.rstrip("\r\n"))
Then, to operate on the data you obtained in a list, you could use a list comprehension similar to the one you have tried to use.
# Assumes that data is the result from the above code
data = [i[:5] + " " + i[5:] if len(i) > 5 else i for i in data]
Hope this helped!
If your only requirement is to insert a space after the fifth character than you could use the following simple version:
#!/usr/bin/env python
with open("input_data") as data:
for line in data.readlines():
line = line.rstrip()
if len(line) > 5:
print(line[0:5]+" "+line[5:])
else:
print(line)
If you don't mind if lines with less than five characters get a space at the end, you could even omit the if-else-statement and go with the print-function from the if-clause:
#!/usr/bin/env python
with open("input_data") as data:
for line in data.readlines():
line = line.rstrip()
print(line[0:5]+" "+line[5:])

bash/sed script to get output from a file using a regex

I have a file which contains lines of the form object 0: data: 2, object 0: data: 232132 in between other lines in the file.I need to extract the data values from the file for all object i and store them space separate in a output file say output using bash or sed.It would great if someone can help me in achieving this.
Example input:
num objects: 3
object 0: name: 'x'
object 0: size: 4
object 0: data: 1
object 1: name: 'y'
object 1: size: 4
object 1: data: 3231
object 2: name: 'x'
object 2: size: 4
object 3: data: -32
Example output:
1 3231 -32
You could use something like this:
awk '$3=="data:"{print $4}' file
This outputs the 4th field when the 3rd field is equal to "data:".
Shorter still, you could just match the pattern /data:/:
awk '/data:/{print $4}' file
To output the numbers on the same line, use printf rather than print. To keep things cleaner, you can use an array and print the values in the END block:
awk '/data:/{a[++n]=$4}END{for(i=1;i<=n;++i)printf "%s%s",$4,(i<n?FS:RS)}' file
Using an array like this makes it easy to separate each value with a space FS and add a newline RS at the end.
Any of these commands can produce an output file using redirection > output.

Sampling random lines from a CSV

I'm working with large CSV. How can I take a random sample of rows—say, 200 total—and recombine them into a CSV with the same structure as the original?
The procedure I would use is as follows:
Generate 200 unique numbers between 0 and the number of lines in the CSV file.
Read each line of the CSV file and keep a track of which line number your are reading. If its line number matches one of the numbers above, then output it.
Use the Resevoir Sampling random sampling technique that does not require all records be in memory or the actual number of records be known. With it, you stream in you records one-by-one and probabilistically select them into the sample. Once the stream is exhausted, output the final sample records. The technique guarantees each record in the stream has the same probability of being in the final sample. That is to say, it generates a simple random sample.
You can use random module's random.sample method to randomize a list of line offsets as shown below.
import random
# Fetching line offsets.
# Courtesy: Adam Rosenfield's tip about how to read a HUGE text file.
# http://stackoverflow.com/questions/620367/
# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
line_offset.append(offset)
offset += len(line)
file.seek(0)
# Part where you pick the random lines and copy to your new file
# My 2 cents.
randoffsets = random.sample(line_offset, 200)
with open('your_file') as f:
for k in randoffsets:
f.seek(k)
f.readline() # and append to your new file
You could try to use linecache if it works for you but since linecache reads the entire file into memory I'm not sure how well it would work for a 6GB file.