Build data using dictionary - python-2.7

I am new to python and i am still learning how it all works. Its been just a week since i started.
I am trying to code a program which does this:
Reads 4 columns from a file (ref input file above)
Get date, day and count from the file
And construct an dictionary to represent date day and count.
basically i wanted to represent the data in something like below and I am stucked in the syntax.
{
"xyz" :
{"Sunday" : {
"20180101" : 72326,
"20180108" : 71120
}
"Monday" : {
"20171225" : 51954,
"20180102" : 51954
}
}
}
INPUT FILE:
DateDay value count floatex
20171225Monday | 270613| 51954|11.41|
20171226Tuesday | 133579| 46126|12.01|
20171227Wednesday| 630613| 71954|11.41|
20171228Thursday | 253779| 96126|12.01|
20171229Friday | 688613| 71054|11.41|
20171230Saturday | 633779| 66126|12.01|
20180101Sunday | 633779| 72326|12.01|
20180102Monday | 630613| 91954|11.41|
20180103Tuesday | 538779| 73326|12.01|
20180104Wednesday| 630613| 61954|11.41|
20180105Thursday | 393379| 75146|12.01|
20180106Friday | 130613| 51954|11.41|
20180107Saturday | 2643329| 70126|12.01|
20180108Sunday | 863979| 71120|12.01|
This is what i Have but its far from what i want: Infact its throwing error now.. But that is not my question. Basically i am trying to understand how do i create the nested dictionary based on the input data
def buildInputDataDictionary(file, ind):
dateCount = {}
dateDay = {}
#dictData = {}
# dateCount[dictData] = {}
with open(file) as f:
for line in f:
items = line.split("|")
date=items[0].strip()[0:8] ##strip spaces and substring to get only the date
count= items[2].strip()
day= items[0].strip()[8:]
dateCount[date] = count
dateDay[date] = day
dictData = {}
dictData[date] = {}
dictData [ind][date] = count
return dateCount,dateDay,dictData
dc,dd, di= buildInputDataDictionary(autoInqRhf, "xyz")
print dd
print dc
print di

In your current script you reset the dictionary in each for loop. You only need to initiate it once outside the loop. When adding nested data you have to make sure, that all keys above the one you want to enter already exists. You can do this like this
#initate dictionary with your identifier as first object
dictData = {ind: dict()}
with open(file) as f:
for line in f:
# extact your data (haven't tested your code)
items = line.split("|")
day = items[0].strip()[0:8]
date = items[0].strip()[0:8]
count = items[2].strip()
# add days
if not day in dictData[ind].keys():
dictData[ind][day] = dict()
dictData[ind][day][date] = count

Related

Convert list to dataframe and then join with different dataframe in pyspark

I am working with pyspark dataframes.
I have a list of date type values:
date_list = ['2018-01-19', '2018-01-20', '2018-01-17']
Also I have a dataframe (mean_df) that has only one column (mean).
+----+
|mean|
+----+
|67 |
|78 |
|98 |
+----+
Now I want to convert date_list into a column and join with mean_df:
expected output:
+------------+----+
|dates |mean|
+------------+----+
|2018-01-19 | 67|
|2018-01-20 | 78|
|2018-01-17 | 98|
+------------+----+
I tried converting list to dataframe (date_df) :
date_df = spark.createDataFrame([(l,) for l in date_list], ['dates'])
and then used monotonically_increasing_id() with new column name "idx" for both date_df and mean_df and used join :
date_df = mean_df.join(date_df, mean_df.idx == date_df.idx).drop("idx")
I get error of timeout exceeded so I changed default broadcastTimeout 300s to 6000s
spark.conf.set("spark.sql.broadcastTimeout", 6000)
But it did not work at all. Also I am working with a really small sample of data right now. The actual data is large enough.
Snippet of code:
date_list = ['2018-01-19', '2018-01-20', '2018-01-17']
mean_list = []
for d in date_list:
h2_df1, h2_df2 = hypo_2(h2_df, d, 2)
mean1 = h2_df1.select(_mean(col('count_before')).alias('mean_before'))
mean_list.append(mean1)
mean_df = reduce(DataFrame.unionAll, mean_list)
You can use withColumn and lit to add the date to the dataframe:
import pyspark.sql.functions as F
date_list = ['2018-01-19', '2018-01-20', '2018-01-17']
mean_list = []
for d in date_list:
h2_df1, h2_df2 = hypo_2(h2_df, d, 2)
mean1 = h2_df1.select(F.mean(F.col('count_before')).alias('mean_before')).withColumn('date', F.lit(d))
mean_list.append(mean1)
mean_df = reduce(DataFrame.unionAll, mean_list)

In the local environment, the result value and the dataflow result values are different

Here is my input data.
ㅡ.Input(Local)
'Iot,c c++ python,2015',
'Web,java spring,2016',
'Iot,c c++ spring,2017',
'Iot,c c++ spring,2017',
This is the result of running apache-beam in a local environment.
ㅡ.Outout(Local)
Iot,2015,c,1
Iot,2015,c++,1
Iot,2015,python,1
Iot,2017,c,2
Iot,2017,c++,2
Iot,2017,spring,2
Web,2016,java,1
Web,2016,spring,1
However, when I run the google-cloud-platform dataflow and put it in a bucket, the results are different.
ㅡ. Storage(Bucket)
Web,2016,java,1
Web,2016,spring,1
Iot,2015,c,1
Iot,2015,c++,1
Iot,2015,python,1
Iot,2017,c,1
Iot,2017,c++,1
Iot,2017,spring,1
Iot,2017,c,1
Iot,2017,c++,1
Iot,2017,spring,1
Here is my code.
ㅡ. Code
#apache_beam
from apache_beam.options.pipeline_options import PipelineOptions
import apache_beam as beam
pipeline_options = PipelineOptions(
project='project-id',
runner='dataflow',
temp_location='bucket-location'
)
def pardo_dofn_methods(test=None):
import apache_beam as beam
class split_category_advanced(beam.DoFn):
def __init__(self, delimiter=','):
self.delimiter = delimiter
self.k = 1
self.pre_processing = []
self.window = beam.window.GlobalWindow()
self.year_dict = {}
self.category_index = 0
self.language_index = 1
self.year_index = 2;
self.result = []
def setup(self):
print('setup')
def start_bundle(self):
print('start_bundle')
def finish_bundle(self):
print('finish_bundle')
for ppc_index in range(len(self.pre_processing)) :
if self.category_index == 0 or self.category_index%3 == 0 :
if self.pre_processing[self.category_index] not in self.year_dict :
self.year_dict[self.pre_processing[self.category_index]] = {}
if ppc_index + 2 == 2 or ppc_index + 2 == self.year_index :
# { category : { year : {} } }
if self.pre_processing[self.year_index] not in self.year_dict[self.pre_processing[self.category_index]] :
self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]] = {}
# { category : { year : c : { }, c++ : { }, java : { }}}
language = self.pre_processing[self.year_index-1].split(' ')
for lang_index in range(len(language)) :
if language[lang_index] not in self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]] :
self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]][language[lang_index]] = 1
else :
self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]][
language[lang_index]] += 1
self.year_index = self.year_index + 3
self.category_index = self.category_index + 1
csvFormat = ''
for category, nested in self.year_dict.items() :
for year in nested :
for language in nested[year] :
csvFormat+= (category+","+str(year)+","+language+","+str(nested[year][language]))+"\n"
print(csvFormat)
yield beam.utils.windowed_value.WindowedValue(
value=csvFormat,
#value = self.pre_processing,
timestamp=0,
windows=[self.window],
)
def process(self, text):
for word in text.split(self.delimiter):
self.pre_processing.append(word)
print(self.pre_processing)
#with beam.Pipeline(options=pipeline_options) as pipeline:
with beam.Pipeline() as pipeline:
results = (
pipeline
| 'Gardening plants' >> beam.Create([
'Iot,c c++ python,2015',
'Web,java spring,2016',
'Iot,c c++ spring,2017',
'Iot,c c++ spring,2017',
])
| 'Split category advanced' >> beam.ParDo(split_category_advanced(','))
| 'Save' >> beam.io.textio.WriteToText("bucket-location")
| beam.Map(print) \
)
if test:
return test(results)
if __name__ == '__main__':
pardo_dofn_methods_basic()
Code for executing simple word counting.
CSV column has an [ category, year, language, count ]
e.g) IoT, 2015, c, 1
Thank you for reading it.
The most likely reason you are getting different output is because of parallelism. When using DataflowRunner, operations run as parallel as possible. Since you are using a ParDo to count, when element Iot,c c++ spring,2017 goes to two different workers, the count doesn't happen as you want (you are counting in the ParDo).
You need to use Combiners (4.2.4)
Here you have an easy example of what you want to do:
def generate_kvs(element, csv_delimiter=',', field_delimiter=' '):
splitted = element.split(csv_delimiter)
fields = splitted[1].split(field_delimiter)
# final key to count is (Source, year, language)
return [(f"{splitted[0]}, {splitted[2]}, {x}", 1) for x in fields]
p = beam.Pipeline()
elements = ['Iot,c c++ python,2015',
'Web,java spring,2016',
'Iot,c c++ spring,2017',
'Iot,c c++ spring,2017']
(p | Create(elements)
| beam.ParDo(generate_kvs)
| beam.combiners.Count.PerKey()
| "Format" >> Map(lambda x: f"{x[0]}, {x[1]}")
| Map(print))
p.run()
This would output the result you want no matter the distribution you get of elements across workers.
Note the idea of Apache Beam is to parallelise as much as possible and, in order to aggregate, you need Combiners
I would recommend you to check some wordcounts examples so you get the hang of the combiners
EDIT
Clarification on Combiners:
ParDo is a operation that happens in a element to element basis. It takes one element, makes some operations and sends the output to the next PTransform. When you need to do aggregate data (count elements, sum values, join sentences...), element wise operations don't work, you need something that takes a PCollection (i.e., many elements with a logic) and outputs something. This is where the combiners come in, they perform operations in a PCollection basis, which can be made across workers (part of the Map-Reduce operations)
In your example, you were using a Class parameter to store the count in the ParDo, so when a element went through it, it would change the parameter within the class. This would work when all elements go through the same worker, since the Class is "created" in a worker basis (i.e., they don't share states), but when there are more workers, the count (with the ParDo) is going to happen in each worker separately

How to save string as json in scala spark

I have the raw of string in logs file . I do many filter and other operation after that . I have reached the following problem as below. I need to convert the string into json format . So that i can save it as a single object.
Suppose i have the following data
Val CDataTime = "20191012"
Val LocationId = "12345"
Val SetInstruc = "Comm=Qwe123,Elem=12345,Elem123=Test"
I am trying to create a data frame that contains datetime|location|jsonofinstruction
The Jsonofstring is the json of third Val; I try to split the string first by comma than by equal to sign and loop through by 2 and create a map of value of one and 2 as value. But json not created . Please help here.
You can use scala.util.parsing.json.JSONObject to convert a map to JSON and then to a string.
val df = spark.createDataset(Seq("Comm=Qwe123,Elem=12345,Elem123=Test")).toDF("col3")
val dfWithJson = df.map{ row =>
val insMap = row.getAs[String]("col3").split(",").map{kv =>
val kvArray = kv.split("=")
(kvArray(0),kvArray(1))
}.toMap
val insJson = JSONObject(insMap).toString()
(row.getAs[String]("col3"),insJson)
}.toDF("col3","col4").show()
Result -
+--------------------+--------------------+
| col3| col4|
+--------------------+--------------------+
|Comm=Qwe123,Elem=...|{"Comm" : "Qwe123...|
+--------------------+--------------------+

How to do a cross join / cartesian product in RavenDB?

I have a web application that uses RavenDB on the backend and allows the user to keep track of inventory. The three entities in my domain are:
public class Location
{
string Id
string Name
}
public class ItemType
{
string Id
string Name
}
public class Item
{
string Id
DenormalizedRef<Location> Location
DenormalizedRef<ItemType> ItemType
}
On my web app, there is a page for the user to see a summary breakdown of the inventory they have at the various locations. Specifically, it shows the location name, item type name, and then a count of items.
The first approach I took was a map/reduce index on InventoryItems:
this.Map = inventoryItems =>
from inventoryItem in inventoryItems
select new
{
LocationName = inventoryItem.Location.Name,
ItemTypeName = inventoryItem.ItemType.Name,
Count = 1
});
this.Reduce = indexEntries =>
from indexEntry in indexEntries
group indexEntry by new
{
indexEntry.LocationName,
indexEntry.ItemTypeName,
} into g
select new
{
g.Key.LocationName,
g.Key.ItemTypeName,
Count = g.Sum(entry => entry.Count),
};
That is working fine but it only displays rows for Location/ItemType pairs that have a non-zero count of items. I need to have it show all Locations and for each location, all item types even those that don't have any items associated with them.
I've tried a few different approaches but no success so far. My thought was to turn the above into a Multi-Map/Reduce index and just add another map that would give me the cartesian product of Locations and ItemTypes but with a Count of 0. Then I could feed that into the reduce and would always have a record for every location/itemtype pair.
this.AddMap<object>(docs =>
from itemType in docs.WhereEntityIs<ItemType>("ItemTypes")
from location in docs.WhereEntityIs<Location>("Locations")
select new
{
LocationName = location.Name,
ItemTypeName = itemType.Name,
Count = 0
});
This isn't working though so I'm thinking RavenDB doesn't like this kind of mapping. Is there a way to get a cross join / cartesian product from RavenDB? Alternatively, any other way to accomplish what I'm trying to do?
EDIT: To clarify, Locations, ItemTypes, and Items are documents in the system that the user of the app creates. Without any Items in the system, if the user enters three Locations "London", "Paris", and "Berlin" along with two ItemTypes "Desktop" and "Laptop", the expected result is that when they look at the inventory summary, they see a table like so:
| Location | Item Type | Count |
|----------|-----------|-------|
| London | Desktop | 0 |
| London | Laptop | 0 |
| Paris | Desktop | 0 |
| Paris | Laptop | 0 |
| Berlin | Desktop | 0 |
| Berlin | Laptop | 0 |
Here is how you can do this with all the empty locations as well:
AddMap<InventoryItem>(inventoryItems =>
from inventoryItem in inventoryItems
select new
{
LocationName = inventoryItem.Location.Name,
Items = new[]{
{
ItemTypeName = inventoryItem.ItemType.Name,
Count = 1}
}
});
)
this.AddMap<Location>(locations=>
from location in locations
select new
{
LocationName = location.Name,
Items = new object[0]
});
this.Reduce = results =>
from result in results
group result by result.LocationName into g
select new
{
LocationName = g.Key,
Items = from item in g.SelectMany(x=>x.Items)
group item by item.ItemTypeName into gi
select new
{
ItemTypeName = gi.Key,
Count = gi.Sum(x=>x.Count)
}
};

Regex / subString to extract all matching patterns / groups

I get this as a response to an API hit.
1735 Queries
Taking 1.001303 to 31.856310 seconds to complete
SET timestamp=XXX;
SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
38 Queries
Taking 1.007646 to 5.284330 seconds to complete
SET timestamp=XXX;
show slave status;
6 Queries
Taking 1.021271 to 1.959838 seconds to complete
SET timestamp=XXX;
SHOW SLAVE STATUS;
2 Queries
Taking 4.825584, 18.947725 seconds to complete
use marketing;
SET timestamp=XXX;
SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
I have extracted this out of the response html and have it as a string now.I need to retrieve values as concisely as possible such that I get a map of values of this format Map(Query -> T1 to T2 seconds) Basically what this is the status of all the slow queries running on MySQL slave server. I am building an alert system over it . So from this entire paragraph in the form of String I need to separate out the queries and save the corresponding time range with them.
1.001303 to 31.856310 is a time range . And against the time range the corresponding query is :
SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
This information I was hoping to save in a Map in scala. A Map of the form (query:String->timeRange:String)
Another example:
("use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified xyz ;"->"4.825584 to 18.947725 seconds")
"""###(.)###(.)\n\n(.*)###""".r.findAllIn(reqSlowQueryData).matchData foreach {m => println("group0"+m.group(1)+"next group"+m.group(2)+m.group(3)}
I am using the above statement to extract the the repeating cells to do my manipulations on it later. But it doesnt seem to be working;
THANKS IN ADvance! I know there are several ways to do this but all the ones striking me are inefficient and tedious. I need Scala to do the same! Maybe I can extract recursively using the subString method ?
If you want use scala try this:
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
val txt = """
|1735 Queries
|
|Taking 1.001303 to 31.856310 seconds to complete
|
|SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
|
|38 Queries
|
|Taking 1.007646 to 5.284330 seconds to complete
|
|SET timestamp=XXX; show slave status;
|
|6 Queries
|
|Taking 1.021271 to 1.959838 seconds to complete
|
|SET timestamp=XXX; SHOW SLAVE STATUS;
|
|2 Queries
|
|Taking 4.825584, 18.947725 seconds to complete
|
|use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
""".stripMargin
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(acc,el) =>
val (taking,map) = acc // taking contains range
taking match {
case Some(range) if el.trim.nonEmpty => //Some contains range
(None,map + ( el -> range)) // add to map
case None =>
regex.findFirstIn(el) match { //extract range
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map) // probably empty line
}
}
map
}
Modified ajozwik's answer to work for SQL commands over multiple lines :
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(accumulator,element) =>
val (taking,map) = accumulator
taking match {
case Some(range) if element.trim.nonEmpty=> {
if (element.contains("Queries"))
(None, map)
else
(Some(range),map+(range->(map.getOrElse(range,"")+element)))
}
case None =>
regex.findFirstIn(element) match {
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map)
}
}
println(map)
map
}