I have an xml file which I have read in and indexed on my editor. I have three status' by which they are denoted in the xml file - GOOD, AVERAGE, POOR; and they are also indexed by a date in DD/MM/YYYY format.
What I have to do is group them initially by month, so therefore stripping the month out of the date which I have successfully done and then loop through again and count the 'GOOD' status' on a per month basis.
I would assume that I would have to group by both Month and Status and then count the occurances of 'GOOD' although I seem to be getting stuck somewhere... what I have developed so far is below, hopefully I just need to tweak it slightly!
...data fed in from xml...
data (doall (get-records (xml/parse is)))
loopA (for [flop data]
{
:STATUS(:QUALITY_STATUS flop)
:MONTH(get (split (:DATE flop) #"/") 1)
})
x(group-by #(select-keys %[:MONTH :STATUS]) loopA)
loopB (str-join \newline (for [floop x]
{
:MONTH_STAMP (second floop)
:Q_STATUS (if (= :STATUS "GOOD") (second floop))
:GOOD_QUAL (count :Q_STATUS)
}))
Ideally what i want to end up with is three columns of data like the below:
Month -- Quality -- Count
01 ------- GOOD ----- 1
02 ------- GOOD ----- 5
...
12 ------- GOOD ----- 3
Thanks in advance! :)
Related
We have a graph that contains both customer and product verticies. For a given product, we want to find out how many customers who signed up before DATE have purchased this product. My query looks something like
g.V('PRODUCT_GUID') // get product vertex
.out('product-customer') // get all customers who ever bought this product
.has('created_on', gte(datetime('2020-11-28T00:33:44.536Z'))) // see if the customer was created after a given date
.count() // count the results
This query is incredibly slow, so I looked at the neptune profiler and saw something odd. Below is the full profiler output. Ignore the elapsed time in the profiler. This was after many attempts at the same query, so the cache is warm. in the wild, it can take 45 seconds or more.
*******************************************************
Neptune Gremlin Profile
*******************************************************
Query String
==================
g.V('PRODUCT_GUID').out('product-customer').has('created_on', gte(datetime('2020-11-28T00:33:44.536Z'))).count()
Original Traversal
==================
[GraphStep(vertex,[PRODUCT_GUID]), VertexStep(OUT,[product-customer],vertex), HasStep([created_on.gte(Sat Nov 28 00:33:44 UTC 2020)]), CountGlobalStep]
Optimized Traversal
===================
Neptune steps:
[
NeptuneCountGlobalStep {
JoinGroupNode {
PatternNode[(?1=<PRODUCT_GUID>, ?5=<product-customer>, ?3, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .], {estimatedCardinality=30586, expectedTotalOutput=30586, indexTime=0, joinTime=14, numSearches=1, actualTotalOutput=13424}
PatternNode[(?3, <created_on>, ?7, ?) . project ask . CompareFilter(?7 >= Sat Nov 28 00:33:44 UTC 2020^^<DATETIME>) .], {estimatedCardinality=1285574, indexTime=10, joinTime=140, numSearches=13424}
}, annotations={path=[Vertex(?1):GraphStep, Vertex(?3):VertexStep], joinStats=true, optimizationTime=0, maxVarId=8, executionTime=165}
}
]
Physical Pipeline
=================
NeptuneCountGlobalStep
|-- StartOp
|-- JoinGroupOp
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?1=<PRODUCT_GUID>, ?5=<product-customer>, ?3, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .], {estimatedCardinality=30586, expectedTotalOutput=30586})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?3, <created_on>, ?7, ?) . project ask . CompareFilter(?7 >= Sat Nov 28 00:33:44 UTC 2020^^<DATETIME>) .], {estimatedCardinality=1285574})
Runtime (ms)
============
Query Execution: 164.996
Traversal Metrics
=================
Step Count Traversers Time (ms) % Dur
-------------------------------------------------------------------------------------------------------------
NeptuneCountGlobalStep 1 1 164.919 100.00
>TOTAL - - 164.919 -
Predicates
==========
# of predicates: 131
Results
=======
Count: 1
Output: [22]
Index Operations
================
Query execution:
# of statement index ops: 13425
# of unique statement index ops: 13425
Duplication ratio: 1.0
# of terms materialized: 0
In particular
DynamicJoinOp(PatternNode[(?3, <created_on>, ?7, ?) . project ask . CompareFilter(?7 >= Sat Nov 28 00:33:44 UTC 2020^^) .], {estimatedCardinality=1285574})
This line surprises me. The way I'm reading this is that Neptune is ignoring the verticies coming from ".out('product-customer')" to satisfy the ".has('created_on'...)" requirement, and is instead joining on every single customer vertex that has the created_on attribute.
I would have expected that the cardinality is only the number of customers with an edge from the product, not every single customer.
I'm wondering if there's a way to only run this comparison on the customers coming from the "out('product-customer')" step.
Neptune actually must solve the first pattern,
(?1=<PRODUCT_GUID>, ?5=<product-customer>, ?3, ?6)
before it can solve the second,
(?3, <created_on>, ?7, ?)
Each quad pattern is an indexed lookup bound by at least two fields. So the first lookup uses the SPOG index in Neptune bound by the Subject (the ID) and the Predicate (the edge label). This will return a set of Objects (the vertex IDs for the vertices at the other end of the product-customer edges) and references them via the ?3 variable for the next pattern.
In the next pattern those vertex IDs (?3) are bound with the Predicate (property key of created-on) to evaluate the condition of the date range. Because this is a conditional evaluation, each vertex in the set of ?3 has to be evaluated (each 'created-on' property on each of those vertices has to be read).
I have a text which contains different news articles about terrorist attacks. Each article starts with an html tag (<p>Advertisement) and I would like to extract from each article a specific information: the number of people wounded in the terrorist attacks.
This is a sample of the text file and how the articles are separated:
[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded 2 police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.” , The two officers were attacked on the Boulevard Lambermont.....]
[<p>Advertisement ,, By KAREEM FAHIM and MOHAMAD FAHIM ABED JUNE 30, 2016
, At least 33 people were killed and 25 were injured when the Taliban bombed buses carrying police cadets on the outskirts of Kabul, Afghanistan, on Thursday. , KABUL, Afghanistan — Taliban insurgents bombed a convoy of buses carrying police cadets on the outskirts of Kabul, the Afghan capital, on Thursday, killing at least 33 people, including four civilians, according to government officials and the United Nations. , During a year...]
This is my code so far:
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
splitted = text.read.split("<p>")
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) were injured")
for article in splitted:
result = re.findall(pattern,article)
The output that I get is:
[]
[]
[]
[('', '40', '')]
[('', '150', '')]
[('94', '', '')]
And I would like to make the output more readable and then save it as csv file:
article_1,0
article_2,0
article_3,40
article_3,150
article_3,94
Any suggestion in how to make it more readable?
I rewrote your loop like this and merged with csv write since you requested it:
import csv
with open ("wounded.csv","w",newline="") as f:
writer = csv.writer(f, delimiter=",")
for i,article in enumerate(splitted):
result = re.findall(pattern,article)
nb_casualties = sum(int(x) for x in result[0] if x) if result else 0
row=["article_{}".format(i+1),nb_casualties]
writer.writerow(row)
get index of the article using enumerate
sum the number of victims (in case more than 1 group matches) using a generator comprehension to convert to integer and pass it to sum, that only if something matched (ternary expression checks that)
create the row
print it, or optionally write it as row (one row per iteration) of a csv.writer object.
Following my earlier question, I have tried to work on a code to return a string if a search term in a certain list is in a string to be returned as follows.
import re
from nltk import tokenize
from nltk.tokenize import sent_tokenize
def foo():
List1 = ['risk','cancer','ocp','hormone','OCP',]
txt = "Risk factors for breast cancer have been well characterized. Breast cancer is 100 times more frequent in women than in men.\
Factors associated with an increased exposure to estrogen have also been elucidated including early menarche, late menopause, later age\
at first pregnancy, or nulliparity. The use of hormone replacement therapy has been confirmed as a risk factor, although mostly limited to \
the combined use of estrogen and progesterone, as demonstrated in the WHI (2). Analysis showed that the risk of breast cancer among women using \
estrogen and progesterone was increased by 24% compared to placebo. A separate arm of the WHI randomized women with a prior hysterectomy to \
conjugated equine estrogen (CEE) versus placebo, and in that study, the use of CEE was not associated with an increased risk of breast cancer (3).\
Unlike hormone replacement therapy, there is no evidence that oral contraceptive (OCP) use increases risk. A large population-based case-control study \
examining the risk of breast cancer among women who previously used or were currently using OCPs included over 9,000 women aged 35 to 64 \
(half of whom had breast cancer) (4). The reported relative risk was 1.0 (95% CI, 0.8 to 1.3) among women currently using OCPs and 0.9 \
(95% CI, 0.8 to 1.0) among prior users. In addition, neither race nor family history was associated with a greater risk of breast cancer among OCP users."
words = txt
corpus = " ".join(words).lower()
sentences1 = sent_tokenize(corpus)
a = [" ".join([sentences1[i-1],j]) for i,j in enumerate(sentences1) if [item in List1] in word_tokenize(j)]
for i in a:
print i,'\n','\n'
foo()
The problem is that the python IDLE does not print anything. What could I have done wrong. What it does is run the code and I get this
>
>
Your question isn't very clear to me so please correct me if i'm getting this wrongly. Are you trying to match the list of keywords (in list1) against the text (in txt)? That is,
For each keyword in list1
Do a match against every sentences in txt.
Print the sentence if they matches?
Instead of writing a complicated regular expression to solve your problem I have broken it down into 2 parts.
First I break the whole lot of text into a list of sentences. Then write simple regular expression to go through every sentences. Trouble with this approach is that it is not very efficient but hey it solves your problem.
Hope this small chunk of code can help guide you to the real solution.
def foo():
List1 = ['risk','cancer','ocp','hormone','OCP',]
txt = "blah blah blah - truncated"
words = txt
matches = []
sentences = re.split(r'\.', txt)
keyword = List1[0]
pattern = keyword
re.compile(pattern)
for sentence in sentences:
if re.search(pattern, sentence):
matches.append(sentence)
print("Sentence matching the word (" + keyword + "):")
for match in matches:
print (match)
--------- Generate random number -----
from random import randint
List1 = ['risk','cancer','ocp','hormone','OCP',]
print(randint(0, len(List1) - 1)) # gives u random index - use index to access List1
I have a frequency table of words which looks like below
> head(freqWords)
employees work bose people company
1879 1804 1405 971 959
employee
100
> tail(freqWords)
youll younggood yoyo ytd yuorself zeal
1 1 1 1 1 1
I want to create another frequency table which will combine similar words and add their frequencies
In above example, my new table should contain both employee and employees as one element with a frequency of 1979. For example
> head(newTable)
employee,employees work bose people
1979 1804 1405 971
company
959
I know how to find out similar words (using adist, stringdist) but I am unable to create the frequency table. For instance I can use following to get a list of similar words
words <- names(freqWords)
lapply(words, function(x) words[stringdist(x, words) < 3])
and following to get a list of similar phrases of two words
lapply(words, function(x) words[stringdist2(x, words) < 3])
where stringdist2 is follwoing
stringdist2 <- function(word1, word2){
min(stringdist(word1, word2),
stringdist(word1, gsub(word2,
pattern = "(.*) (.*)",
repl="\\2,\\1")))
}
I do not have any punctuation/special symbols in my words/phrases. (I do not know a lot of R; I created stringdist2 by tweaking an implementation of adist2 I found here but I do not understand everything about how pattern and repl works)
So I need help to create new frequency table.
This question is in the context of twoway line with the by() option, but I think the bigger problem is how to identify a second (and all subsequent) event windows without a priori knowing every event window.
Below I generate some data with five countries over the 1990s and 2000s. In all countries an event occurs in 1995 and in Canada only the event repeats in 2005. I would like to plot outcome over the five years centered on each event in each country. If I do this using twoway line and by(), then Canada plots twice in the same plot window.
clear
set obs 100
generate year = 1990 + mod(_n, 20)
generate country = "United Kingdom" in 1/20
replace country = "United States" in 21/40
replace country = "Canada" in 41/60
replace country = "Australia" in 61/80
replace country = "New Zealand" in 81/100
generate event = (year == 1995) ///
| ((year == 2005) & (country == "Canada"))
generate time_to_event = 0 if (event == 1)
generate outcome = runiform()
encode country, generate(countryn)
xtset countryn year
forvalue i = 1/2 {
replace time_to_event = `i' if (l`i'.event == 1)
replace time_to_event = -`i' if (f`i'.event == 1)
}
twoway line outcome time_to_event, ///
by(country) name(orig, replace)
A manual solution adds an occurrence variable that numbers each event occurrence by country, then adds occurrence to the by() option.
generate occurrence = 1 if !missing(time_to_event)
replace occurrence = 2 if ///
(inrange(year, 2005 - 2, 2005 + 2) & (country == "Canada"))
twoway line outcome time_to_event, ///
by(country occurrence) name(attempt, replace)
This works great in the play data, but in my real data there are many more countries and many more events. I can manually code this occurrence variable, but that is tedious (and now I'm really curious if there's a tool or logic that works :) ).
Is there a logic to automate identifying windows? Or one that at least works with twoway line? Thanks!
You have generated a variable time_to_event which is -2 .. 2 in a window and missing otherwise. You can use tsspell from SSC, installed by
ssc inst tsspell
to label such windows. Windows are defined by spells or runs of observations all non-missing on that time_to_event:
tsspell, cond(time_to_event < .)
tsspell requires a prior tsset and generates three variables explained in its help. You can then renumber windows by using one of those variables _seq (sequence number within spell, numbered 1 up)
gen _spell2 = (_seq > 0) * sum(_seq == 1)
and then label spells distinctly by using country and the spell identifier for each spell from _spell, another variable produced by tsspell:
egen gspell = group(country _spell) if _spell2, label
My code assumes that windows are disjoint and cannot overlap, but that seems to be one of your assumptions too. Some technique for handling spells is given by http://www.stata-journal.com/sjpdf.html?articlenum=dm0029 That article does not mention tsspell, which in essence is an implementation of its principles. I started explaining the principles, but the article got long enough before I could explain the program. As the help of tsspell is quite detailed, I doubt that a sequel paper is needed, or at least that it will be written.
(LATER) This code also assumes that windows don't touch. Solving that problem suggests a more direct approach not involving tsspell at all:
bysort country (year) : gen w_id = (time_to_event < .) * sum(time_to_event == -2)
egen w_label = group(country w_id) if w_id, label