My code looks like below which consists of transformations:
dictionaryDf = spark.read.option("header", "true").csv(
"s3://...../.csv")
web_notif_data = fullLoad.cache()
web_notif_data.persist(StorageLevel.MEMORY_AND_DISK)
print("::::::data has been loaded::::::::::::")
distinct_campaign_name = web_notif_data.select(
trim(web_notif_data.campaign_name).alias("campaign_name")).distinct()
web_notif_data.createOrReplaceTempView("temp")
variablesList = Config.get('web', 'variablesListWeb')
web_notif_data = spark.sql(variablesList)
web_notif_data.persist(StorageLevel.MEMORY_AND_DISK)
web_notif_data = web_notif_data.withColumn("camp", regexp_replace("campaign_name", "_", ""))
web_notif_data = web_notif_data.drop("campaign_name")
web_notif_data = web_notif_data.withColumnRenamed("camp", "campaign_name")
web_notif_data = web_notif_data.withColumn("channel", lit("web_notification"))
web_notif_data.createOrReplaceTempView("data")
campaignTeamWeb = Config.get('web', 'campaignTeamWeb')
web_notif_data = spark.sql(campaignTeamWeb)
web_notif_data.persist(StorageLevel.MEMORY_AND_DISK)
distinct_campaign_name = distinct_campaign_name.withColumn("camp", F.regexp_replace(
F.lower(F.trim(col("campaign_name"))),
"[^a-zA-Z0-9]", ""))
output_df3 = (
distinct_campaign_name.withColumn("cname_split",
F.explode(F.split(F.lower(F.trim(col("campaign_name"))), "_")))
.join(
dictionaryDf,
(
(
(F.col("function") == "contains") &
F.col("camp").contains(F.col("terms"))
) |
(
(F.col("function") == "match") &
F.col("campaign_name").contains("_") &
(F.col("cname_split") == F.col("terms"))
)
),
"left"
)
.withColumn(
"empty_is_other",
F.when(
(
F.col("product").isNull() &
F.col("product_category").isNull()
),
"other"
)
)
.withColumn(
"rn",
F.row_number().over(
Window.partitionBy("campaign_name")
.orderBy(
F.when(
F.col("function").isNull(), 3
).when(
F.col("function") == "match", 2
).otherwise(1),
F.length(F.col("terms")).desc(),
F.col("product").isNull()
)
)
)
.filter("rn=1")
.select(
"campaign_name",
F.coalesce("product", "empty_is_other").alias("prod"),
F.coalesce("product_category", "empty_is_other").alias("prod_cat"),
)
.na.fill("")
)
print(":::::::::::transformations have been done finally::::::::::::")
web_notif_data1 = web_notif_data # Just taking the backup of DF in case something goes wrong
web_notif_data = web_notif_data.drop("campaign_name")
web_notif_data = web_notif_data.withColumnRenamed("temp_campaign_name", "campaign_name")
veryFinalDF = web_notif_data.join(output_df3, "campaign_name", "left_outer")
# veryFinalDF.show(truncate=False)
veryFinalDF.write.mode("overwrite").parquet(aggregatedPath)
print("::::final data have been written successfully::::::")
where fullLoad is the data frame that reads from the Redshift table.This code works fine on 0.2 Million records. However, in production for 15 days, data could be around a minimum of 25 Million records. I don't know the size since the data is stored in the redshift table and we are reading from it and then processing the data. I am running this code via Glue jobs and it gets stuck on the last line i.e while writing the data as parquet. It gives me the below error:
I tried running it with 30 executors. It takes around 20 mins to load the data from Redshift into the fullLoad dataframe. What else can be done to avoid this error? I am new to AWS and Glue jobs.
Related
I'm trying to relabel a bunch of Power BI reports and I am doing that with this script.
(Source) =>
let
res = List.Accumulate(
Text.ToList(Source),
[result="", index=0, source=Source],
fnAccumulator
),
IsUpper = (txt) => txt <> "" and txt = Text.Upper(txt) and txt <> Text.Lower(txt),
fnAccumulator =
(state as record, current as text) as record =>
let
prevCharacter =
if state[index]=0 then
""
else
Text.At(state[source], state[index] - 1),
prevCharacter2 =
if state[index]<=1 then
""
else
Text.At(state[source], state[index] - 2),
nextCharacter =
if state[index] = Text.Length(state[source]) - 1 then
""
else
Text.At(state[source], state[index] + 1),
aggregatedResult =
if state[index]=0 or current = "" then
current
else
if IsUpper(current) and
(not IsUpper(prevCharacter) or
(IsUpper(prevCharacter2) and not IsUpper(nextCharacter))) then
state[result] & " " & current
else
state[result] & current,
resultRecord =
if aggregatedResult = null then
null
else
[result = aggregatedResult, index = state[index]+1, source = state[source]]
in
resultRecord
in
res[result]
The problem is after running this on every table and applying the changes all the relationships in the report end up nuked.
Before
After
My thought process is that because my script isn't taking into account SID's and renaming those when they should stay the same this is breaking the relationships. Would appreciate either an alternative way of mass renaming these columns or if there is a way to add an exclusion statement into my script that will ignore any and all SIDs
I have the data below
create table #data (Post_Code varchar(10), Internal_Code varchar(10))
insert into #data values
('AB10','hb3'),('AB10','hb25'),('AB12','dd1'),('AB15','hb6'),('AB16','aa4'),('AB16','hb7'),
('AB16','aa2'),('AB16','ab9'),('AB18','rr6'),('AB18','rr9'),('AB18','hb10'),('AB20','rr15'),
('AB20','td2'),('AB21','hb8'),('AB21','cc4'),('AB21','cc4'),('AB24','td5'),('AB9','yy3'),
('RM2','CC1'),('RM6','hb6'),('RM7','cc2'),('SA24','rr1'),('SA24','hb5'),('SA24','rr2'),
('SA24','cc34'),('SE15','rr9'),('SE15','rr5'),('SE25','rr10'),('SE25','hb11'),('SE25','rr8'),
('SE25','rr1'),('LA15','rr2')
select * from #data
drop table #data
What I want to achieve is if the same post code area have “hb” or “rr” in the same post code I want to return 1 else 0
The “hb” or “rr” internal_code must be in the same post_code if they in different post code. It should be 0
I wrote this DAX
Result = IF(left(Data[Internal_Code],2)="hb" || left(Data[Internal_Code],2)="rr",1,0)
it is not returning the correct result
current output
expected output
I think your expected result is incorrect as SA24 should also be 1. You should definitely do a calculation like this in PQ but if you need to do it in DAX in a calculated column, then use the following code which works.
Result =
VAR post_code = Data[Post_code]
RETURN
VAR hb = CALCULATE (COUNTROWS(Data),'Data'[Post_code] = post_code && left(Data[Internal_Code],2) = "hb" )
VAR rr = CALCULATE (COUNTROWS(Data),'Data'[Post_code] = post_code && left(Data[Internal_Code],2) = "rr" )
RETURN IF(hb>0 && rr > 0,1)
I am currently stuck on below issue:
I have two tables that I have to work with, one contains financial information for vessels and the other contains arrival and departure time for vessels. I get my data combining multiple excel sheets from different folders:
financialTable
voyageTimeTable
I have to calculate the result for above voyage, and apportion the result over June, July and August for both estimated and updated.
Time in June : 4 hours (20/06/2020 20:00 - 23:59) + 10 days (21/06/2020 00:00 - 30/06/2020 23:59) = 10.1666
Time in July : 31 full days
Time in August: 1 day + 14 hours (02/08/2020 00:00 - 14:00) = 1.5833
Total voyage duration = 10.1666 + 31 + 1.5833 = 42.7499
The result for the "updated" financialItem would be the following:
Result June : 100*(10.1666/42.7499) = 23.7816
Result July : 100*(31/42.7499) = 72.5148
Result August : 100*(1.5833/42.7499) = 3.7036
sum = 100
and then for "estimated" it would be twice of everything above.
This is the format I ideally would like to get:
prorataResultTable
I have to do this for multiple vessels, with multiple timespans and several voyage numbers.
Eagerly awaiting responses, if any. Many thanks in advance.
Brds,
Not sure if you're still looking for an answer, but code below gives me your expected output:
let
financialTable = Table.FromRows({{"A", 1, "profit/loss", 200, 100}}, type table [vesselName = text, vesselNumber = Int64.Type, financialItem = text, estimated = number, updated = number]),
voyageTimeTable = Table.FromRows({{"A", 1, #datetime(2020, 6, 20, 20, 0, 0), #datetime(2020, 8, 2, 14, 0, 0)}}, type table [vesselName = text, vesselNumber = Int64.Type, voyageStartDatetime = datetime, voyageEndDatetime = datetime]),
joined =
let
joined = Table.NestedJoin(financialTable, {"vesselName", "vesselNumber"}, voyageTimeTable, {"vesselName", "vesselNumber"}, "$toExpand", JoinKind.LeftOuter),
expanded = Table.ExpandTableColumn(joined, "$toExpand", {"voyageStartDatetime", "voyageEndDatetime"})
in expanded,
toExpand = Table.AddColumn(joined, "$toExpand", (currentRow as record) =>
let
voyageInclusiveStart = DateTime.From(currentRow[voyageStartDatetime]),
voyageExclusiveEnd = DateTime.From(currentRow[voyageEndDatetime]),
voyageDurationInDays = Duration.TotalDays(voyageExclusiveEnd - voyageInclusiveStart),
createRecordForPeriod = (someInclusiveStart as datetime) => [
inclusiveStart = someInclusiveStart,
exclusiveEnd = List.Min({
DateTime.From(Date.EndOfMonth(DateTime.Date(someInclusiveStart)) + #duration(1, 0, 0, 0)),
voyageExclusiveEnd
}),
durationInDays = Duration.TotalDays(exclusiveEnd - inclusiveStart),
prorataDuration = durationInDays / voyageDurationInDays,
estimated = prorataDuration * currentRow[estimated],
updated = prorataDuration * currentRow[updated],
month = Date.MonthName(DateTime.Date(inclusiveStart)),
year = Date.Year(inclusiveStart)
],
monthlyRecords = List.Generate(
() => createRecordForPeriod(voyageInclusiveStart),
each [inclusiveStart] < voyageExclusiveEnd,
each createRecordForPeriod([exclusiveEnd])
),
toTable = Table.FromRecords(monthlyRecords)
in toTable
),
expanded =
let
dropped = Table.RemoveColumns(toExpand, {"estimated", "updated", "voyageStartDatetime", "voyageEndDatetime"}),
expanded = Table.ExpandTableColumn(dropped, "$toExpand", {"month", "year", "estimated", "updated"})
in expanded
in
expanded
The code tries to:
join financialTable and voyageTimeTable, so that for each vesselName and vesselNumber combination, we know: estimated, updated, voyageStartDatetime and voyageEndDatetime.
generate a list of months for the period between voyageStartDatetime and voyageEndDatetime (which get expanded into new table rows)
for each month (in the list), do all the arithmetic you mention in your question
get rid of some columns (like the old estimated and updated columns)
I recommend testing it with different vesselNames and vesselNumbers from your dataset, just to see if the output is always correct (I think it should be).
You should be able to manually inspect the cells in the $toExpand column (of the toExpand step/expression) to see the nested rows before they get expanded.
I get this as a response to an API hit.
1735 Queries
Taking 1.001303 to 31.856310 seconds to complete
SET timestamp=XXX;
SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
38 Queries
Taking 1.007646 to 5.284330 seconds to complete
SET timestamp=XXX;
show slave status;
6 Queries
Taking 1.021271 to 1.959838 seconds to complete
SET timestamp=XXX;
SHOW SLAVE STATUS;
2 Queries
Taking 4.825584, 18.947725 seconds to complete
use marketing;
SET timestamp=XXX;
SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
I have extracted this out of the response html and have it as a string now.I need to retrieve values as concisely as possible such that I get a map of values of this format Map(Query -> T1 to T2 seconds) Basically what this is the status of all the slow queries running on MySQL slave server. I am building an alert system over it . So from this entire paragraph in the form of String I need to separate out the queries and save the corresponding time range with them.
1.001303 to 31.856310 is a time range . And against the time range the corresponding query is :
SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
This information I was hoping to save in a Map in scala. A Map of the form (query:String->timeRange:String)
Another example:
("use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified xyz ;"->"4.825584 to 18.947725 seconds")
"""###(.)###(.)\n\n(.*)###""".r.findAllIn(reqSlowQueryData).matchData foreach {m => println("group0"+m.group(1)+"next group"+m.group(2)+m.group(3)}
I am using the above statement to extract the the repeating cells to do my manipulations on it later. But it doesnt seem to be working;
THANKS IN ADvance! I know there are several ways to do this but all the ones striking me are inefficient and tedious. I need Scala to do the same! Maybe I can extract recursively using the subString method ?
If you want use scala try this:
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
val txt = """
|1735 Queries
|
|Taking 1.001303 to 31.856310 seconds to complete
|
|SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
|
|38 Queries
|
|Taking 1.007646 to 5.284330 seconds to complete
|
|SET timestamp=XXX; show slave status;
|
|6 Queries
|
|Taking 1.021271 to 1.959838 seconds to complete
|
|SET timestamp=XXX; SHOW SLAVE STATUS;
|
|2 Queries
|
|Taking 4.825584, 18.947725 seconds to complete
|
|use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
""".stripMargin
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(acc,el) =>
val (taking,map) = acc // taking contains range
taking match {
case Some(range) if el.trim.nonEmpty => //Some contains range
(None,map + ( el -> range)) // add to map
case None =>
regex.findFirstIn(el) match { //extract range
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map) // probably empty line
}
}
map
}
Modified ajozwik's answer to work for SQL commands over multiple lines :
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(accumulator,element) =>
val (taking,map) = accumulator
taking match {
case Some(range) if element.trim.nonEmpty=> {
if (element.contains("Queries"))
(None, map)
else
(Some(range),map+(range->(map.getOrElse(range,"")+element)))
}
case None =>
regex.findFirstIn(element) match {
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map)
}
}
println(map)
map
}
Good afternoon,
I have a problem when displaying the contents of my table in a list. The action does not make the impression of the data in each row and do not understand why.
I have the 1076 version of Corona SDK and does not work, but with the previous IF it worked.
I hope your help.
local function onRowRender( event )
print("oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo")
local phase = event.phase
local row = event.row
local rowGroup = event.view
local label = aux.corrigeEspeciales (rowTitles[ row.index ])
local color = 20
print ("label" .. label)
row.itemName = label
row.textObj = display.newRetinaText(rowGroup,label, 0, 0, "Verdana", 12 )
row.textObj:setTextColor( color )
row.textObj:setReferencePoint( display.CenterLeftReferencePoint )
row.textObj.x, row.textObj.y = 20, rowGroup.contentHeight * 0.5
rowGroup:insert( row.textObj )
row.arrow = display.newImage( "images/tiendarowArrow.png", false )
row.arrow.x = rowGroup.contentWidth - row.arrow.contentWidth * 2
row.arrow.y = rowGroup.contentHeight * 0.5
rowGroup:insert( row.arrow )
end
You can show text info in 2013.1076 Corona SDK version using:
local phase = event.phase
local row = event.row
local rowGroup = event.view
local label = aux.corrigeEspeciales (rowTitles[ row.index ])
local color = 20
row.itemName = label
local rowTitle = display.newText(row,label, 0, 0, "Verdana", 12 )
rowTitle.x = row.x - ( row.contentWidth * 0.5 ) + ( rowTitle.contentWidth * 0.5 )
rowTitle.y = row.contentHeight * 0.5
rowTitle:setTextColor( 0, 0, 0 )
Corona 1076 I think uses Widget 2.0 widgets. They require a different syntax for determining the row group. See this article on updating your widgets to the new syntax.
http://docs.coronalabs.com/api/library/widget/migration.html
Also there is a new public build, 1135 that contains considerable bug fixes to the widget library. I would suggest upgrading so that you have these fixes.