On a Windows 10 machine, I seem to be running into substantially increased write times on our cache files.
Below I have included timing operations for our writes with/without Defender's intervention. For this test, we are writing 32KB blocks to a 1GB, pre-allocated cache file, 36000 times.
Here are file write times with Windows Defender enabled (default behavior on machines):
### Manager: [CacheFile] --> 123.524 secs
### Count: 36784. Time: 123524(ms). Average: 3(ms). Max Finished: 218(ms).
### Unfinished: 0. Max Unfinished: 0(ms). Min Unfinished: 0(ms).
### Max Finished Item: [Name: [DirectFileWrite:4294967293]. Pid: 0x00000000000002E8. Tid: 0x00000000000010A0. Data: 0x0000000000000000.].
### Max Unfinished Item: [].
### Min Unfinished Item: [].
### Reporting Time: 0(ms).
And here's the same operations performed when our cache file is added to Windows Defender's exclusion list:
### Manager: [CacheFile] --> 9.194 secs
### Count: 36784. Time: 9194(ms). Average: 0(ms). Max Finished: 126(ms).
### Unfinished: 0. Max Unfinished: 0(ms). Min Unfinished: 0(ms).
### Max Finished Item: [Name: [DirectFileWrite:4294967293]. Pid: 0x00000000000006F4. Tid: 0x000000000000130C. Data: 0x0000000000000000.].
### Max Unfinished Item: [].
### Min Unfinished Item: [].
### Reporting Time: 0(ms).
#########
I'm thinking that Windows Defender is running some sort of check on the entire file (opening and checking the entire data of the 1GB file), every time we write to the file.
Adding the cache files to the exclusion list would be the last option, so I'm wondering if anyone run in to any issues of similar nature?
I'm using Windows C++ API for all I/O operations.
Related
We have a graph that contains both customer and product verticies. For a given product, we want to find out how many customers who signed up before DATE have purchased this product. My query looks something like
g.V('PRODUCT_GUID') // get product vertex
.out('product-customer') // get all customers who ever bought this product
.has('created_on', gte(datetime('2020-11-28T00:33:44.536Z'))) // see if the customer was created after a given date
.count() // count the results
This query is incredibly slow, so I looked at the neptune profiler and saw something odd. Below is the full profiler output. Ignore the elapsed time in the profiler. This was after many attempts at the same query, so the cache is warm. in the wild, it can take 45 seconds or more.
*******************************************************
Neptune Gremlin Profile
*******************************************************
Query String
==================
g.V('PRODUCT_GUID').out('product-customer').has('created_on', gte(datetime('2020-11-28T00:33:44.536Z'))).count()
Original Traversal
==================
[GraphStep(vertex,[PRODUCT_GUID]), VertexStep(OUT,[product-customer],vertex), HasStep([created_on.gte(Sat Nov 28 00:33:44 UTC 2020)]), CountGlobalStep]
Optimized Traversal
===================
Neptune steps:
[
NeptuneCountGlobalStep {
JoinGroupNode {
PatternNode[(?1=<PRODUCT_GUID>, ?5=<product-customer>, ?3, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .], {estimatedCardinality=30586, expectedTotalOutput=30586, indexTime=0, joinTime=14, numSearches=1, actualTotalOutput=13424}
PatternNode[(?3, <created_on>, ?7, ?) . project ask . CompareFilter(?7 >= Sat Nov 28 00:33:44 UTC 2020^^<DATETIME>) .], {estimatedCardinality=1285574, indexTime=10, joinTime=140, numSearches=13424}
}, annotations={path=[Vertex(?1):GraphStep, Vertex(?3):VertexStep], joinStats=true, optimizationTime=0, maxVarId=8, executionTime=165}
}
]
Physical Pipeline
=================
NeptuneCountGlobalStep
|-- StartOp
|-- JoinGroupOp
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?1=<PRODUCT_GUID>, ?5=<product-customer>, ?3, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .], {estimatedCardinality=30586, expectedTotalOutput=30586})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?3, <created_on>, ?7, ?) . project ask . CompareFilter(?7 >= Sat Nov 28 00:33:44 UTC 2020^^<DATETIME>) .], {estimatedCardinality=1285574})
Runtime (ms)
============
Query Execution: 164.996
Traversal Metrics
=================
Step Count Traversers Time (ms) % Dur
-------------------------------------------------------------------------------------------------------------
NeptuneCountGlobalStep 1 1 164.919 100.00
>TOTAL - - 164.919 -
Predicates
==========
# of predicates: 131
Results
=======
Count: 1
Output: [22]
Index Operations
================
Query execution:
# of statement index ops: 13425
# of unique statement index ops: 13425
Duplication ratio: 1.0
# of terms materialized: 0
In particular
DynamicJoinOp(PatternNode[(?3, <created_on>, ?7, ?) . project ask . CompareFilter(?7 >= Sat Nov 28 00:33:44 UTC 2020^^) .], {estimatedCardinality=1285574})
This line surprises me. The way I'm reading this is that Neptune is ignoring the verticies coming from ".out('product-customer')" to satisfy the ".has('created_on'...)" requirement, and is instead joining on every single customer vertex that has the created_on attribute.
I would have expected that the cardinality is only the number of customers with an edge from the product, not every single customer.
I'm wondering if there's a way to only run this comparison on the customers coming from the "out('product-customer')" step.
Neptune actually must solve the first pattern,
(?1=<PRODUCT_GUID>, ?5=<product-customer>, ?3, ?6)
before it can solve the second,
(?3, <created_on>, ?7, ?)
Each quad pattern is an indexed lookup bound by at least two fields. So the first lookup uses the SPOG index in Neptune bound by the Subject (the ID) and the Predicate (the edge label). This will return a set of Objects (the vertex IDs for the vertices at the other end of the product-customer edges) and references them via the ?3 variable for the next pattern.
In the next pattern those vertex IDs (?3) are bound with the Predicate (property key of created-on) to evaluate the condition of the date range. Because this is a conditional evaluation, each vertex in the set of ?3 has to be evaluated (each 'created-on' property on each of those vertices has to be read).
Hi everyone I'm using the re.match function to extract pieces of string within a row from the file.
My code is as follows:
## fp_tmp => pointer of file
for x in fp_tmp:
try:
cpuOverall=re.match(r"(Overall CPU load average)\s+(\S+)(%)",x)
cpuUsed=re.match(r"(Total)\s+(\d+)(%)",x)
ramUsed=re.match(r"(RAM Utilization)\s+(\d+\%)",x)
####Not Work####
if cpuUsed is not None: cpuused_new=cpuUsed.group(2)
if ramUsed is not None: ramused_new=ramUsed.group(2)
if cpuOverall is not None: cpuoverall_new=cpuOverall.group(2)
except:
searchbox_result = None
Each field is extracted from the following corresponding line:
ramUsed => RAM Utilization 2%
cpuUsed => Total 4%
cpuOverall => Overall CPU load average 12%
ramUsed, cpuUsed, cpuOverall are the variable where I want write the result!!
Corretly line are:
(space undefined) RAM Utilization 2%
(space undefined) Total 4%
(space undefined) Overall CPU load average 12%
When I execute the script all variable return a value: None.
With other variable the script work corretly.
Why the code not work in this case? I use the python3
I think that the problem is a caracter % that not read.
Do you have any suggestions?
PROBLEM 2:
## fp_tmp => pointer of file
for x in fp_tmp:
try:
emailReceived=re.match(r".*(Messages Received)\s+\S+\s+\S+\s+(\S+)",x)
####Not Work####
if emailReceived is not None: emailreceived_new=emailReceived.group(2)
except:
searchbox_result = None
Each field is extracted from the following corresponding on 2 lines in a file:
[....]
Counters: Reset Uptime Lifetime
Receiving
Messages Received 3,406 1,558 3,406
[....]
Rates (Events Per Hour): 1-Minute 5-Minutes 15-Minutes
Receiving
Messages Received 0 0 0
Recipients Received 0 0 0
[....]
I want extract only second occured, that:
Rates (Events Per Hour): 1-Minute 5-Minutes 15-Minutes
Receiving
Messages Received 0 0 0 <-this
Do you have any suggestions?
cpuOverall line: you forgot that there is more information at the start of the line. Change to
'.*(Overall CPU load average)\s+(\S+%)'
cpuUsed line: you forgot that there is more information at the start of the line. Change to
'.*(Total)\s+(\d+%)'
ramUsed line: you forgot that there is more information at the start of the line... Change to
'.*(RAM Utilization)\s+(\d+%)'
Remember that re.match looks for an exact match from the start:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. [..]
With these changes, your three variables are set to the percentages:
>>> print (cpuused_new,ramused_new,cpuoverall_new)
4% 2% 12%
I am creating a GUI using tkinter.
In the code below, I am using two methods to place labels on a GUI's screen.
In the method 1, I have initialised label, and then used label_place and label_forget to place the labels on different screen.
In the second method, I use pointers to label text variable option, and I used the set() property to set the text at different positions.
I wish to know which method is efficient.
As with method 1 on every page I have to use the forget method, and then replace the labels.
In the second method, I do pointer[i].set("text") to place test, and pointer[i].set(" ") for no text .
#! /usr/bin/python
from Tkinter import *
class Gui(Frame):
def __init__(self,parent):
Frame.__init__(self,parent)
self.parent=parent
self.data_Frame()
Label_Tuple=["a:","b:","b:","d:","a:","b:","b:","d:","a:","b:","b:","d:"]
pos=[0,1,2,3,4,5,6,7,8,9,10,11]
self.label_init()
self.label_place(Label_Tuple,pos)
#self.label_forget()
#self.Data_Frame()
#self.widgets_place(Label_Tuple)
#self.widgets_forget()
def data_Frame(self):
self.data_frame=Frame(self.parent, bg = "cyan")
self.data_frame.place(x=10,y=55,anchor=NW,height=370,width=510)
self.data_frame.config(highlightbackground='black',highlightthickness='0')
def label_init(self):
self.label=[]
for i in range(0,12):
self.label.append(Label(self.data_frame,width=14,bg="cyan",anchor=NE,font=("Helvetica", 15)))
def label_place(self,Label,position):
self.place_x=13
self.place_y=5
self.incr=30
j=0
for i,label in enumerate(Label):
while True:
if position[i]==j:
self.label[i].config(text=Label[i])
self.label[i].place(x=self.place_x,y=self.place_y)
break
else:
self.place_y +=self.incr
j +=1
def label_forget(self):
for i in range(0,12):
self.para_label[i].place_forget()
def Data_Frame(self):
self.Data_Frame=Frame(self.parent, bg = "cyan")
self.Data_Frame.place(x=10,y=55,anchor=NW,height=370,width=510)
self.Data_Frame.config(highlightbackground='black',highlightthickness='0')
self.label=[]
place_x=13
place_y=10
incr=30
self.string_var0=[]
for i in range(0,12):
variable=StringVar()
self.label.append(Label(self.Data_Frame,width=14,textvariable=variable,bg="red",anchor=NW,font=("Helvetica", 15)))
self.string_var0.append(variable)
self.label[i].place(x=place_x,y=place_y)
place_y +=incr
def widgets_place(self,Label_Tuple):
for i in range(0,12):
self.string_var0[i].set(Label_Tuple[i])
def widgets_forget(self):
Label_Tuple=["","","","","","","","","","","",""]
for i in range(0,12):
self.string_var0[i].set("")
if __name__== "__main__":
root =Tk()
#root.attributes('-fullscreen',True)
root.geometry("750x490")
#root.config(cursor="none")
#bigfont = tkFont.Font(family="Helvetica",size=15,slant="roman")
#root.option_add("*TCombobox*Listbox*Font", bigfont)
#root.option_add("*TCombobox*Listbox*Font", bigfont)
app=Gui(root)
app.mainloop()
How can i check the execution speed and the memory occupied by a this python code or the code execution time taken .
I use the OS's time command. Note that it gives you all three, system time, user time, and wall clock time.
/usr/bin/time -v program_name.py
You can also use top or htop to view memory usage, but usage will always be changing when the program is running. There is also ps and smaps but I have not used either.
results are something like the following (note that modern systems
use as much of the CPU as is available and give up some of it when
necessary. I am on a multi-core system so the program had
all of one core to itself).
Command being timed: "./test_r_1.py"
User time (seconds): 0.54
System time (seconds): 0.28
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.82
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 51248
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 37484
Voluntary context switches: 57
Involuntary context switches: 739
Swaps: 0
File system inputs: 0
File system outputs: 176
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
I am using optimizer in Pyalgotrade to run my strategy to find the best parameters. The message I get is this:
2015-04-09 19:33:35,545 broker.backtesting [DEBUG] Not enough cash to fill 600800 order [1681] for 888 share/s
2015-04-09 19:33:35,546 broker.backtesting [DEBUG] Not enough cash to fill 600800 order [1684] for 998 share/s
2015-04-09 19:33:35,547 server [INFO] Partial result 7160083.45 with parameters: ('600800', 4, 19) from worker-16216
2015-04-09 19:33:36,049 server [INFO] Best final result 7160083.45 with parameters: ('600800', 4, 19) from client worker-16216
This is just part of the message. You can see only for parameters ('600800', 4, 19) and ('600800', 4, 19) we have result, for other combination of parameters, I get the message : 546 broker.backtesting [DEBUG] Not enough cash to fill 600800 order [1684] for 998 share/s.
I think this message means that I have created a buy order but I do not have enough cash to busy it. However, from my script below:
shares = self.getBroker().getShares(self.__instrument)
if bars[self.__instrument].getPrice() > up and shares == 0:
sharesToBuy = int(self.getBroker().getCash()/ bars[self.__instrument].getPrice())
self.marketOrder(self.__instrument, sharesToBuy)
if shares != 0 and bars[self.__instrument].getPrice() > up_stop:
self.marketOrder(self.__instrument, -1 * shares)
if shares != 0 and bars[self.__instrument].getPrice() < up:
self.marketOrder(self.__instrument, -1 * shares)
The logic of my strategy is that is the current price is larger than up, we buy, and if the current price is larger than up_stop or smaller than up after we buy, we sell. So from the code, there is no way that I will generate an order which I do not have enough cash to pay because the order is calculated by my current cash.
So where do I get wrong?
You calculate the order size based on the current price, but the price for the next bar may have gone up. The order is not filled in the current bar, but starting from the next bar.
With respect to the 'Partial result' and 'Best final result' messages, how many combinations of parameters are you trying ? Note that if you are using 10 different combinations, you won't get 10 different 'Partial result' because they are evaluated in batches of 200 combinations and only the best partial result for each batch of 200 combinations gets printed.
I'm using Hive and I'm encountering an exception when I'm performing a query with a custom InputFormat.
When I use the query select * from micmiu_blog; Hive works without problems, but if I use select * from micmiu_blog where 1=1; it seems that the framework cannot find my custom InputFormat class.
I have put the JAR file into "hive/lib","hadoop/lib" and I have also put "hadoop/lib" into the CLASSPATH. This is the log:
hive> select * from micmiu_blog where 1=1;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1415530028127_0004, Tracking URL = http:/ /hadoop01-master:8088/proxy/application_1415530028127_0004/
Kill Command = /home/hduser/hadoop-2.2.0/bin/hadoop job -kill job_1415530028127_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-11-09 19:53:32,996 Stage-1 map = 0%, reduce = 0%
2014-11-09 19:53:52,010 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1415530028127_0004 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1415530028127_0004_m_000000 (and more) from job job_1415530028127_0004
Task with the most failures(4):
-----
Task ID:
task_1415530028127_0004_m_000000
URL:
http://hadoop01-master:8088/taskdetails.jsp?jobid=job_1415530028127_0004&tipid=task_1415530028127_0004_m_000000
-----
Diagnostic Messages for this Task:
Error: java.io.IOException: cannot find class hiveinput.MyDemoInputFormat
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:564)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:167)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:408)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
I met the problem just now. You should add your JAR file to the CLASSPATH with hive cli.
You can do it:
hive> add jar /usr/lib/xxx.jar;