Background: I have a column with email subject. I want these to be maximum 30 characters long. For the user to spot that i have cut the ones over 30 characters, i want to add a "..." suffix.
Problem: If the column content is over 30 characters, i want to remove all characters over 30, and add "..." to the end of the string.
What i have tried: I have added the following steps in the Power Query Editor, but it adds "..." to all lines, also the ones under 30 characters.
#"Extracted First Characters" = Table.TransformColumns(#"Duplicated Column", {{"subject - Copy", each Text.Start(_, 30), type text}}),
#"Renamed Columns1" = Table.RenameColumns(#"Extracted First Characters",{{"subject - Copy", "subject - short"}}),
#"Added Suffix" = Table.TransformColumns(#"Renamed Columns1", {{"subject - short", each _ & "...", type text}}),
Thanks in advance
You can transform the subject column, in one step:
= Table.TransformColumns(#"Previous Step", {{"Subject", each if Text.Length(_) > 30 then Text.Start(_, 30) & "..." else _, type text}})
We test if the text length is greater than 30 characters, and if so, return only the first 30 characters suffixed by "...", otherwise just return the text as is.
Related
I have a good sheet that I want to grab the header which a date time stamp which will match against another sheet find the entries with that date and suburb and type and give me an average cost.
My formula is =AVERAGEIFS(Sheet1!C:C,Sheet1!A:A, B11:B, Sheet1!F:F, C10) which gives me the average but i've hard coded the header date:
example:
What I want to do is dynamically add the data from the row above with the date time instead of of manually adding it in the formula something like this:
=AVERAGEIFS(Sheet1!C:C,Sheet1!A:A, B11:B, Sheet1!F:F, =CHAR(COLUMN()+64) & 10)
Which would automatically grab the column + row 10 e.g C10, D10, E10.
If i put =CHAR(COLUMN()+64) & 10 in its own cell it works but when I add it to averageifs condition it gives me a parsing error.
Expecting C10, D10, E10 from =CHAR(COLUMN()+64) & 10 which should allow me to dynamically filter data on the date int he header above it.
try:
=AVERAGEIFS(Sheet1!C:C, Sheet1!A:A, B11:B, Sheet1!F:F, INDIRECT(CHAR(COLUMN()+64)&10))
I'm trying to create a month/year table on power query (M). It seems that has no easy way to do that. Here is the code I created to reach that goal.
Its unclear exactly what you are looking for but
let Years = Table.ExpandListColumn(Table.FromRecords({[Years = {1980..2020}]}), "Years"),
#"Add Months" = Table.ExpandListColumn(Table.AddColumn(Years, "Months", each {1..12}), "Months"),
#"Add Month Names" = Table.AddColumn(#"Add Months", "MonthName", each Date.MonthName(#datetime([Years], [Months], 1,0,0,0)))
in #"Add Month Names"
generates a table like this, and you can change the start/ending year in the code
Here is my code. If anybody could find a easier and more elegant solution, it will be welcomed.
To use this code, create a blanked query, go to Advanced Editor and replace the existing code for this one.
let
first_date = #date(2020, 1, 1),
last_day = DateTime.LocalNow(),
num_months = ((Date.Year(last_day) - Date.Year(first_date)) * 12 + Date.Month(last_day) - Date.Month(first_date)),
list_of_num = List.Numbers(0, num_months, 1),
table_from_list = Table.FromValue(list_of_num, [DefaultColumnName = "Index"]),
add_col_year = Table.AddColumn(table_from_list, "Year", each Date.Year(Date.AddMonths(first_date, [Index])), Int64.Type),
add_col_month = Table.AddColumn(add_col_year, "Mes", each Date.Month(Date.AddMonths(first_date, [Index])), Int64.Type)
in
add_col_month
There are 165 Values here and they are Comma Separated. This step is called CommaSeperated
workItemList is a function which takes in the value from CommaSeperated and Brings out the table here
I want to split the 165 items in CommaSeparated into batches of 100 and call workItemList for each batch.
Any ideas on how it must be done?
i have managed to take a comma separated text string and turn it into a grouped table of 100's.
first 100 is called "100" next 100 is called "200" and starts at 101.
enter image description here
here are my steps
enter image description here
First i split the text string into columns using comma as the separator tag, using the for each option.
Then i transposed the whole thing into one column.
Add an index.
Add a modulus of 100
Add a custom column "100counter": = Table.AddColumn(#"Inserted Modulo", "100counter", each if [Modulus]=99 then [Indeks]+1 else null)
forget the renamed columns step :D
used Fill up first, because the to initial 0-100 will have "null" as their modulus.
used Fill down second, because the last records can be null if it doesnt end exactly on 99.
Grouped by "100counter", all rows.
Abbridged code - i cleaned up the 1700 columns in the split by delimiter step :
let
Source = Excel.CurrentWorkbook(){[Name="Tabel2"]}[Content],
#"Split Column by Delimiter" = //Alot of spam code here which is essentially just alll the columns being split. Mark ALL, choose split columns by delimiter, choose comma and all.
#"Transposed Table" = Table.Transpose(#"Split Column by Delimiter"),
#"Added Index" = Table.AddIndexColumn(#"Transposed Table", "Indeks", 0, 1),
#"Inserted Modulo" = Table.AddColumn(#"Added Index", "Modulus", each Number.Mod([Indeks], 100), type number),
#"Added Custom" = Table.AddColumn(#"Inserted Modulo", "100counter", each if [Modulus]=99 then [Indeks]+1 else null),
#"Filled Up" = Table.FillUp(#"Added Custom",{"100counter"}),
#"Filled Down" = Table.FillDown(#"Filled Up",{"100counter"}),
#"Grouped Rows" = Table.Group(#"Filled Down", {"100counter"}, {{"Antal", each _, type table [Column1=number, Indeks=number, Modulus=number, 100counter=number]}})
in
#"Grouped Rows"
I'm trying to optimize my Glue/PySpark job by using push down predicates.
start = date(2019, 2, 13)
end = date(2019, 2, 27)
print(">>> Generate data frame for ", start, " to ", end, "... ")
relaventDatesDf = spark.createDataFrame([
Row(start=start, stop=end)
])
relaventDatesDf.createOrReplaceTempView("relaventDates")
relaventDatesDf = spark.sql("SELECT explode(generate_date_series(start, stop)) AS querydatetime FROM relaventDates")
relaventDatesDf.createOrReplaceTempView("relaventDates")
print("===LOG:Dates===")
relaventDatesDf.show()
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights", push_down_predicate="""
querydatetime BETWEEN '%s' AND '%s'
AND querydestinationplace IN (%s)
""" % (start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d"), ",".join(map(lambda s: str(s), arr))))
However it appears, that Glue still attempts to read data outside the specified date range?
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-01/part-00045-6cdebbb1-562c-43fa-915d-93b125aeee61.c000.snappy.parquet' for reading
INFO FileScanRDD: Reading File path: s3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet, range: 0-11797922, partition values: [12191,17965]
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet' for reading
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Notice the querydatetime=2019-03-01 and querydatetime=2019-03-10 its outside the specified range of 2019-02-13 - 2019-02-27. Is that why there's the next line "aborting HTTP connection" tho? It goes on to say "This is likely an error and may result in sub-optimal behavior" is something wrong?
I wonder if the problem is because it does not support BETWEEN inside the predicate or IN?
The table create DDL
CREATE EXTERNAL TABLE `flights`(
`id` string,
`querytaskid` string,
`queryoriginplace` string,
`queryoutbounddate` string,
`queryinbounddate` string,
`querycabinclass` string,
`querycurrency` string,
`agent` string,
`quoteageinminutes` string,
`price` string,
`outboundlegid` string,
`inboundlegid` string,
`outdeparture` string,
`outarrival` string,
`outduration` string,
`outjourneymode` string,
`outstops` string,
`outcarriers` string,
`outoperatingcarriers` string,
`numberoutstops` string,
`numberoutcarriers` string,
`numberoutoperatingcarriers` string,
`indeparture` string,
`inarrival` string,
`induration` string,
`injourneymode` string,
`instops` string,
`incarriers` string,
`inoperatingcarriers` string,
`numberinstops` string,
`numberincarriers` string,
`numberinoperatingcarriers` string)
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://pinfare-glue/flights/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='pinfare-parquet',
'averageRecordSize'='19',
'classification'='parquet',
'compressionType'='none',
'objectCount'='623609',
'recordCount'='4368434222',
'sizeKey'='86509997099',
'typeOfData'='file')
One of the issue I can see with the code is that you are using "today" instead of "end" in the between clause. Though I don't see the today variable declared anywhere in your code, I am assuming it has been initialized with today's date.
In that case the range will be different and the partitions being read by glue spark is correct.
In order to push down your condition, you need to change the order of columns in your partition by clause of table definition
A condition having "in" predicate on first partition column can not be push down as you are expecting.
Let me if it helps.
Pushdown predicates in Glue DynamicFrame works fine with between as well as IN clause.
As long as you have correct sequence of partition columns defined in table definition and in query.
I have table with three level of partitions.
s3://bucket/flights/year=2018/month=01/day=01 -> 50 records
s3://bucket/flights/year=2018/month=02/day=02 -> 40 records
s3://bucket/flights/year=2018/month=03/day=03 -> 30 records
Read data in dynamicFrame
ds = glueContext.create_dynamic_frame.from_catalog(
database = "abc",table_name = "pqr", transformation_ctx = "flights",
push_down_predicate = "(year == '2018' and month between '02' and '03' and day in ('03'))"
)
ds.count()
Output:
30 records
So, you are gonna get the correct results, if sequence of columns is correctly specified. Also note, you need to specify '(quote) IN('%s') in IN clause.
Partition columns in table:
querydestinationplace string,
querydatetime string
Data read in DynamicFrame:
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights",
push_down_predicate=
"""querydestinationplace IN ('%s') AND
querydatetime BETWEEN '%s' AND '%s'
"""
%
( ",".join(map(lambda s: str(s), arr)),
start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d")))
Try to do the end as this
start = str(date(2019, 2, 13))
end = str(date(2019, 2, 27))
# Set your push_down_predicate variable
pd_predicate = "querydatetime >= '" + start + "' and querydatetime < '" + end + "'"
#pd_predicate = "querydatetime between '" + start + "' AND '" + end + "'" # Or this one?
flightsGDF = glueContext.create_dynamic_frame.from_catalog(
database = "xxx"
, table_name = "flights"
, transformation_ctx="flights"
, push_down_predicate=pd_predicate)
The pd_predicate will be a string that will work as a push_down_predicate.
Here is a nice read about it if you like.
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/
I have the following code to build PDF document with Prawn:
items = [["PERIOD","EMPLOYEE", "EMPLOYEE NAME", "HOURS", "FTES"]]
items += #mandates.each.map do |mandate|
[
mandate[:fte_period_end_date],
mandate[:fte_employee_id],
strname,
mandate[:fte_sum_of_hours],
mandate[:fte_sum_of_ftes],
]
end
#mandates is sorted by fte_employee_id and fte_by period_end_date
I want to insert totals lines per employe for fte_sum_of_hours and fte_sum_of_ftes when pass throw next employee.
What command permits me to insert these lines with Prawn?
Pass them in the array that you are displaying the total from - in Ruby, calculate for each section of elements, the total. Don't do the work in Prawn (it's not Excel).
data = [["product 1: ","$10.00"],["product 2: ", "$20.00"],["Subtotal:","$30.00]]
For example. Then you can format the table with consideration to row 3, the subtotal, with cell styles.