show_chunks() does not correspond to what DELETE complains about - compression

This is really puzzling. I need to delete a date from a hypertable from timescaleDB 1.7:
DELETE FROM raw WHERE tm::date = '2020-11-06' -- the local date style is YYYY-MM-DD
Before doing that, I check what chunks I need to decompress, giving it one day margin, and receive two chunks:
SELECT show_chunks('raw', newer_than => '2020-11-05 00:00'::timestamp)
---
Result:
"_timescaledb_internal._hyper_1_19_chunk"
"_timescaledb_internal._hyper_1_21_chunk"
So I decompress these two. However when I run the DELETE command above, I still get an error about totally different chunk:
ERROR: cannot update/delete rows from chunk "_hyper_1_1_chunk" as it
is compressed SQL state: XX000
BTW this chunk is empty as far as I can see by looking at it in the pgAdmin. Any idea what's going on? Looks like a bug to me, but maybe I'm doing something wrong?
Thanks!
Edit:
Below is an excerpt from the result of EXPLAIN DELETE, as requested by #k_rus:
EXPLAIN DELETE FROM raw WHERE tm::date = '2020-11-06'
Result:
"Delete on raw (cost=0.00..719.63 rows=147 width=6)"
" Delete on raw"
" Delete on _hyper_1_1_chunk"
" Delete on _hyper_1_2_chunk"
...
" Delete on _hyper_1_22_chunk"
" -> Seq Scan on raw (cost=0.00..0.00 rows=1 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"
" -> Custom Scan (CompressChunkDml) on _hyper_1_1_chunk (cost=0.00..27.40 rows=6 width=6)"
" -> Seq Scan on _hyper_1_1_chunk (cost=0.00..27.40 rows=6 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"
" -> Custom Scan (CompressChunkDml) on _hyper_1_2_chunk (cost=0.00..27.40 rows=6 width=6)"
" -> Seq Scan on _hyper_1_2_chunk (cost=0.00..27.40 rows=6 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"
...
" -> Custom Scan (CompressChunkDml) on _hyper_1_22_chunk (cost=0.00..27.40 rows=6 width=6)"
" -> Seq Scan on _hyper_1_22_chunk (cost=0.00..27.40 rows=6 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"

Thank you for providing the explain. The explain shows that the DELETE statement is planned to touch all chunks of the hypertable and only in runtime the execution of the DELETE statement will realise that it is nothing to delete in many chunks:
EXPLAIN DELETE FROM raw WHERE tm::date = '2020-11-06'
Result:
"Delete on raw (cost=0.00..719.63 rows=147 width=6)"
" Delete on raw"
" Delete on _hyper_1_1_chunk"
" Delete on _hyper_1_2_chunk"
...
" Delete on _hyper_1_22_chunk"
" -> Seq Scan on raw (cost=0.00..0.00 rows=1 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"
" -> Custom Scan (CompressChunkDml) on _hyper_1_1_chunk (cost=0.00..27.40 rows=6 width=6)"
" -> Seq Scan on _hyper_1_1_chunk (cost=0.00..27.40 rows=6 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"
" -> Custom Scan (CompressChunkDml) on _hyper_1_2_chunk (cost=0.00..27.40 rows=6 width=6)"
" -> Seq Scan on _hyper_1_2_chunk (cost=0.00..27.40 rows=6 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"
...
" -> Custom Scan (CompressChunkDml) on _hyper_1_22_chunk (cost=0.00..27.40 rows=6 width=6)"
" -> Seq Scan on _hyper_1_22_chunk (cost=0.00..27.40 rows=6 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"
Since some chunks are compressed, TimescaleDB returns error on the deletes planned for compressed chunks.
The only way to not get the error is to have the selection condition to trigger chunk exclusion at the planning time. In the question the selection condition is tm::date = '2020-11-06', which first extracts date from the column tm and then compares with the constant. Thus the planner cannot decide if the chunk is filtered or not and instead push the filter down for runtime execution on every chunk.
To resolve this it is good to have a selection condition, which compares the time dimension column with constant or value, which can be calculated at planning time. Assuming tm is the time dimension column in the hypertable raw, I suggest to convert the constant date into timestamp, e.g., '2020-11-06'::timestamp and keep the column. You will need to specify the range of timestamps to cover all rows belonging to the targeted date.
For example, the DELETE statement can be:
DELETE FROM raw WHERE tm BETWEEN '2020-11-06 00:00' AND '2020-11-06 23:59'
Answers to questions:
show_chunks() does not correspond to what DELETE complains about
show_chunk statement and DELETE statement have different conditions and thus cannot be compared directly. show_chunk only shows chunks, which cover the time newer than the given constant. While DELETE is planned to check every chunk, thus it can complain on any chunk of the hypertable.
BTW this chunk is empty as far as I can see by looking at it in the pgAdmin. Any idea what's going on? Looks like a bug to me, but maybe I'm doing something wrong?
The compressed chunk stores data in a different internal chunk, thus no data can be seen in _hyper_1_1_chunk. TimescaleDB assumes that data are read through the hypertable, not directly from the chunks. Hypertable is an abstraction, which hides implementation details of TimescaleDB.

What specific version of 1.7? There was a bug on this, but should be fixed from 1.7.3 forward. https://github.com/timescale/timescaledb/pull/2092
If you’re on 1.7.3 or later and still seeing this, it’d be best to open an issue on the timescaledb GitHub repo.
You can check your version by connecting with psql and running \dx

Related

Postgres db index not being used on Heroku

I'm trying to debug a slow query for a model that looks like:
class Employee(TimeStampMixin):
title = models.TextField(blank=True,db_index=True)
seniority = models.CharField(blank=True,max_length=128,db_index=True)
The query is: Employee.objects.exclude(seniority='').filter(title__icontains=title).order_by('seniority').values_list('seniority')
When I run it locally it takes ~0.3 seconds (same database size). An explain locally shows:
Limit (cost=1000.58..196218.23 rows=7 width=1) (actual time=299.016..300.366 rows=1 loops=1)
Output: seniority
Buffers: shared hit=2447163 read=23669
-> Gather Merge (cost=1000.58..196218.23 rows=7 width=1) (actual time=299.015..300.364 rows=1 loops=1)
Output: seniority
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=2447163 read=23669
-> Parallel Index Only Scan using companies_e_seniori_12ac68_idx on public.companies_employee (cost=0.56..195217.40 rows=3 width=1) (actual time=293.195..293.200 rows=0 loops=3)
Output: seniority
Filter: (((companies_employee.seniority)::text <> ''::text) AND (upper(companies_employee.title) ~~ '%INFORMATION SPECIALIST%'::text))
Rows Removed by Filter: 2697599
Heap Fetches: 2819
Buffers: shared hit=2447163 read=23669
Worker 0: actual time=291.087..291.088 rows=0 loops=1
Buffers: shared hit=820222 read=7926
Worker 1: actual time=291.056..291.056 rows=0 loops=1
Buffers: shared hit=812538 read=7888
Planning Time: 0.209 ms
Execution Time: 300.400 ms
however when I run the same code on Heroku I get execution times of 3s+, possibly because the former is using an index while the second is not:
Limit (cost=216982.74..216983.39 rows=6 width=1) (actual time=988.738..1018.964 rows=1 loops=1)
Output: seniority
Buffers: shared hit=199527 dirtied=5
-> Gather Merge (cost=216982.74..216983.39 rows=6 width=1) (actual time=980.932..1011.157 rows=1 loops=1)
Output: seniority
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=199527 dirtied=5
-> Sort (cost=215982.74..215982.74 rows=3 width=1) (actual time=959.233..959.234 rows=0 loops=3)
Output: seniority
Sort Key: companies_employee.seniority
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=199527 dirtied=5
Worker 0: actual time=957.414..957.414 rows=0 loops=1
Sort Method: quicksort Memory: 25kB
JIT:
Functions: 4
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 1.179 ms, Inlining 0.000 ms, Optimization 0.879 ms, Emission 9.714 ms, Total 11.771 ms
Buffers: shared hit=54855 dirtied=2
Worker 1: actual time=939.591..939.592 rows=0 loops=1
Sort Method: quicksort Memory: 25kB
JIT:
Functions: 4
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.741 ms, Inlining 0.000 ms, Optimization 0.654 ms, Emission 6.531 ms, Total 7.926 ms
Buffers: shared hit=87867 dirtied=1
-> Parallel Seq Scan on public.companies_employee (cost=0.00..215982.73 rows=3 width=1) (actual time=705.244..959.146 rows=0 loops=3)
Output: seniority
Filter: (((companies_employee.seniority)::text <> ''::text) AND (upper(companies_employee.title) ~~ '%INFORMATION SPECIALIST%'::text))
Rows Removed by Filter: 2939330
Buffers: shared hit=199449 dirtied=5
Worker 0: actual time=957.262..957.262 rows=0 loops=1
Buffers: shared hit=54816 dirtied=2
Worker 1: actual time=939.491..939.491 rows=0 loops=1
Buffers: shared hit=87828 dirtied=1
Query Identifier: 2827140323627869732
Planning:
Buffers: shared hit=293 read=1 dirtied=1
I/O Timings: read=0.021
Planning Time: 1.078 ms
JIT:
Functions: 13
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 2.746 ms, Inlining 0.000 ms, Optimization 2.224 ms, Emission 23.189 ms, Total 28.160 ms
Execution Time: 1050.493 ms
I confirmed the model indexes are identical for my local database and on Heroku, this is what they are:
indexname | indexdef
----------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------
companies_employee_pkey | CREATE UNIQUE INDEX companies_employee_pkey ON public.companies_employee USING btree (id)
companies_employee_company_id_c24081a8 | CREATE INDEX companies_employee_company_id_c24081a8 ON public.companies_employee USING btree (company_id)
companies_employee_person_id_936e5c6a | CREATE INDEX companies_employee_person_id_936e5c6a ON public.companies_employee USING btree (person_id)
companies_employee_role_8772f722 | CREATE INDEX companies_employee_role_8772f722 ON public.companies_employee USING btree (role)
companies_employee_role_8772f722_like | CREATE INDEX companies_employee_role_8772f722_like ON public.companies_employee USING btree (role text_pattern_ops)
companies_employee_seniority_b10393ff | CREATE INDEX companies_employee_seniority_b10393ff ON public.companies_employee USING btree (seniority)
companies_employee_seniority_b10393ff_like | CREATE INDEX companies_employee_seniority_b10393ff_like ON public.companies_employee USING btree (seniority varchar_pattern_ops)
companies_employee_title_78009330 | CREATE INDEX companies_employee_title_78009330 ON public.companies_employee USING btree (title)
companies_employee_title_78009330_like | CREATE INDEX companies_employee_title_78009330_like ON public.companies_employee USING btree (title text_pattern_ops)
companies_employee_institution_75d6c7e9 | CREATE INDEX companies_employee_institution_75d6c7e9 ON public.companies_employee USING btree (institution)
companies_employee_institution_75d6c7e9_like | CREATE INDEX companies_employee_institution_75d6c7e9_like ON public.companies_employee USING btree (institution text_pattern_ops)
companies_e_seniori_12ac68_idx | CREATE INDEX companies_e_seniori_12ac68_idx ON public.companies_employee USING btree (seniority, title)
title_seniority | CREATE INDEX title_seniority ON public.companies_employee USING btree (upper(title), seniority)

How Scan Values with colon ":" from a Dynamodb Table?

I am facing a similar problem, using boto3 the query does not work, while it works on console.
First I tried this scan without success:
text = 'city:barcelona'
filter_expr = Attr('timestamp').between('2020-04-01', '2020-04-27')
filter_expr = filter_expr & Attr('text').eq(text)
table.scan(FilterExpression = filter_expr, Limit = 1000)
Then, I notice that for a text variable that does not contain ":", the scan works.
So, I tried this second scan using ExpressionAttributeNames and ExpressionAttributeValues
table.scan(
FilterExpression = "#n0 between :v0 AND :v1 AND #n1 = :v2",
ExpressionAttributeNames = {'#n0': 'timestamp', '#n1': 'text'},
ExpressionAttributeValues = {
':v0': '2020-04-01',
':v1': '2020-04-27',
':v2': {"S": text}},
Limit = 1000
)
Failed again.
By the end, if I change in the first example:
text = 'barcelona'
filter_expr = filter_expr & Attr('text').contains(text)
I can get the records. IMO, it is clear that the problem is the ":"
Is there another way to search by texts with ":" character?
[writing an answer so that we can close out the question]
I ran both examples and they worked correctly for me. I configured text and timestamp as string fields. Check you have an up to date boto3 library.
Note: I changed ':v2': {"S": text} to ':v2': text because you're using resource level scan and you don't need to supply the low-level attribute type (it's only required for client level scan).

AWS Glue pushdown predicate not working properly

I'm trying to optimize my Glue/PySpark job by using push down predicates.
start = date(2019, 2, 13)
end = date(2019, 2, 27)
print(">>> Generate data frame for ", start, " to ", end, "... ")
relaventDatesDf = spark.createDataFrame([
Row(start=start, stop=end)
])
relaventDatesDf.createOrReplaceTempView("relaventDates")
relaventDatesDf = spark.sql("SELECT explode(generate_date_series(start, stop)) AS querydatetime FROM relaventDates")
relaventDatesDf.createOrReplaceTempView("relaventDates")
print("===LOG:Dates===")
relaventDatesDf.show()
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights", push_down_predicate="""
querydatetime BETWEEN '%s' AND '%s'
AND querydestinationplace IN (%s)
""" % (start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d"), ",".join(map(lambda s: str(s), arr))))
However it appears, that Glue still attempts to read data outside the specified date range?
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-01/part-00045-6cdebbb1-562c-43fa-915d-93b125aeee61.c000.snappy.parquet' for reading
INFO FileScanRDD: Reading File path: s3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet, range: 0-11797922, partition values: [12191,17965]
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet' for reading
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Notice the querydatetime=2019-03-01 and querydatetime=2019-03-10 its outside the specified range of 2019-02-13 - 2019-02-27. Is that why there's the next line "aborting HTTP connection" tho? It goes on to say "This is likely an error and may result in sub-optimal behavior" is something wrong?
I wonder if the problem is because it does not support BETWEEN inside the predicate or IN?
The table create DDL
CREATE EXTERNAL TABLE `flights`(
`id` string,
`querytaskid` string,
`queryoriginplace` string,
`queryoutbounddate` string,
`queryinbounddate` string,
`querycabinclass` string,
`querycurrency` string,
`agent` string,
`quoteageinminutes` string,
`price` string,
`outboundlegid` string,
`inboundlegid` string,
`outdeparture` string,
`outarrival` string,
`outduration` string,
`outjourneymode` string,
`outstops` string,
`outcarriers` string,
`outoperatingcarriers` string,
`numberoutstops` string,
`numberoutcarriers` string,
`numberoutoperatingcarriers` string,
`indeparture` string,
`inarrival` string,
`induration` string,
`injourneymode` string,
`instops` string,
`incarriers` string,
`inoperatingcarriers` string,
`numberinstops` string,
`numberincarriers` string,
`numberinoperatingcarriers` string)
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://pinfare-glue/flights/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='pinfare-parquet',
'averageRecordSize'='19',
'classification'='parquet',
'compressionType'='none',
'objectCount'='623609',
'recordCount'='4368434222',
'sizeKey'='86509997099',
'typeOfData'='file')
One of the issue I can see with the code is that you are using "today" instead of "end" in the between clause. Though I don't see the today variable declared anywhere in your code, I am assuming it has been initialized with today's date.
In that case the range will be different and the partitions being read by glue spark is correct.
In order to push down your condition, you need to change the order of columns in your partition by clause of table definition
A condition having "in" predicate on first partition column can not be push down as you are expecting.
Let me if it helps.
Pushdown predicates in Glue DynamicFrame works fine with between as well as IN clause.
As long as you have correct sequence of partition columns defined in table definition and in query.
I have table with three level of partitions.
s3://bucket/flights/year=2018/month=01/day=01 -> 50 records
s3://bucket/flights/year=2018/month=02/day=02 -> 40 records
s3://bucket/flights/year=2018/month=03/day=03 -> 30 records
Read data in dynamicFrame
ds = glueContext.create_dynamic_frame.from_catalog(
database = "abc",table_name = "pqr", transformation_ctx = "flights",
push_down_predicate = "(year == '2018' and month between '02' and '03' and day in ('03'))"
)
ds.count()
Output:
30 records
So, you are gonna get the correct results, if sequence of columns is correctly specified. Also note, you need to specify '(quote) IN('%s') in IN clause.
Partition columns in table:
querydestinationplace string,
querydatetime string
Data read in DynamicFrame:
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights",
push_down_predicate=
"""querydestinationplace IN ('%s') AND
querydatetime BETWEEN '%s' AND '%s'
"""
%
( ",".join(map(lambda s: str(s), arr)),
start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d")))
Try to do the end as this
start = str(date(2019, 2, 13))
end = str(date(2019, 2, 27))
# Set your push_down_predicate variable
pd_predicate = "querydatetime >= '" + start + "' and querydatetime < '" + end + "'"
#pd_predicate = "querydatetime between '" + start + "' AND '" + end + "'" # Or this one?
flightsGDF = glueContext.create_dynamic_frame.from_catalog(
database = "xxx"
, table_name = "flights"
, transformation_ctx="flights"
, push_down_predicate=pd_predicate)
The pd_predicate will be a string that will work as a push_down_predicate.
Here is a nice read about it if you like.
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/

Splitting an array into columns in Athena/Presto

I feel this should be simple, but I've struggled to find the right terminology, please bear with me.
I have two columns, timestamp and voltages which is the array
If I do a simple
SELECT timestamp, voltages FROM table
Then I'd get a result of:
|timestamp | voltages |
|1544435470 |3.7352,3.749,3.7433,3.7533|
|1544435477 |3.7352,3.751,3.7452,3.7533|
|1544435484 |3.7371,3.749,3.7433,3.7533|
|1544435490 |3.7352,3.749,3.7452,3.7533|
|1544435497 |3.7352,3.751,3.7452,3.7533|
|1544435504 |3.7352,3.749,3.7452,3.7533|
But I want to split the voltages array so each element in its array is its own column.
|timestamp | v1 | v2 | v3 | v4 |
|1544435470 |3.7352 |3.749 |3.7433 |3.7533|
|1544435477 |3.7352 |3.751 |3.7452 |3.7533|
|1544435484 |3.7371 |3.749 |3.7433 |3.7533|
|1544435490 |3.7352 |3.749 |3.7452 |3.7533|
|1544435497 |3.7352 |3.751 |3.7452 |3.7533|
|1544435504 |3.7352 |3.749 |3.7452 |3.7533|
I know I can do this with:
SELECT timestamp, voltages[1] as v1, voltages[2] as v2 FROM table
But I'd need to be able to do this programmatically, as opposed to listing them out.
Am I missing something obvious?
This should serve your purpose if you have arrays of fixed length.
You need to first break down each array element into it's own row. You can do this using the UNNEST operator in the following way :
SELECT timestamp, volt
FROM table
CROSS JOIN UNNEST(voltages) AS t(volt)
Using the resultant table you can pivot (convert multiple rows with the same timestamp into multiple columns) by referring to Gordon Linoff's answer for "need to convert data in multiple rows with same ID into 1 row with multiple columns".

Apache Beam: How To Simultaneously Create Many PCollections That Undergo Same PTransform?

Thanks in advance!
[+] Issue:
I have a lot of files on google cloud, for every file I have to:
get the file
Make a bunch of Google-Cloud-Storage API calls on each file to index it(e.g. name = blob.name, size = blob.size)
unzip it
search for stuff in there
put the indexing information + stuff found inside file in a BigQuery Table
I've been using python2.7 and the Google-Cloud-SDK. This takes hours if I run it linearly. I was suggested Apache Beam/DataFlow to process in parallel.
[+] What I've been able to do:
I can read from one file, perform a PTransform and write to another file.
def loadMyFile(pipeline, path):
return pipeline | "LOAD" >> beam.io.ReadFromText(path)
def myFilter(request):
return request
with beam.Pipeline(options=PipelineOptions()) as p:
data = loadMyFile(pipeline,path)
output = data | "FILTER" >> beam.Filter(myFilter)
output | "WRITE" >> beam.io.WriteToText(google_cloud_options.staging_location)
[+] What I want to do:
How can I load many of those files simultaneously, perform the same transform to them in parallel, then in parallel write to big query?
Diagram Of What I Wish to Perform
[+] What I've Read:
https://beam.apache.org/documentation/programming-guide/
http://enakai00.hatenablog.com/entry/2016/12/09/104913
Again, many thanks
textio accepts a file_pattern.
From Python sdk:
file_pattern (str) – The file path to read from as a local file path or a GCS gs:// path. The path can contain glob characters
For example, suppose you have a bunch of *.txt files in storage gs://my-bucket/files/, you can say:
with beam.Pipeline(options=PipelineOptions()) as p:
(p
| "LOAD" >> beam.io.textio.ReadFromText(file_pattern="gs://my-bucket/files/*.txt")
| "FILTER" >> beam.Filter(myFilter)
| "WRITE" >> beam.io.textio.WriteToText(output_ocation)
If you somehow do have multiple PCollections of the same type, you can also Flatten them into a single one
merged = (
(pcoll1, pcoll2, pcoll3)
# A list of tuples can be "piped" directly into a Flatten transform.
| beam.Flatten())
Ok so I resolved this by doing the following:
1) get the name of a bucket from somewhere | first PCollection
2) get a list of blobs from that bucket | second PCollection
3) do a FlatMap to get blobs individually from the list | third PCollection
4) do a ParDo that gets the metadata
5) write to BigQuery
my pipeline looks like this:
with beam.Pipeline(options=options) as pipe:
bucket = pipe | "GetBucketName" >> beam.io.ReadFromText('gs://example_bucket_eraseme/bucketName.txt')
listOfBlobs = bucket | "GetListOfBlobs" >> beam.ParDo(ExtractBlobs())
blob = listOfBlobs | "SplitBlobsIndividually" >> beam.FlatMap(lambda x: x)
dic = blob | "GetMetaData" >> beam.ParDo(ExtractMetadata())
dic | "WriteToBigQuery" >> beam.io.WriteToBigQuery(