How to make Spark session read all the files recursively? - regex

Displaying the directories under which JSON files are stored:
$ tree -d try/
try/
├── 10thOct_logs1
├── 11thOct
│   └── logs2
└── Oct
└── 12th
└── logs3
Task is to read all logs using SparkSession.
Is there an elegant way to read through all the files in directories and then sub-directories recursively?
Few commands that I tried are prone to cause unintentional exclusion.
spark.read.json("file:///var/foo/try/<exp>")
+----------+---+-----+-------+
| <exp> -> | * | */* | */*/* |
+----------+---+-----+-------+
| logs1 | y | y | n |
| logs2 | n | y | y |
| logs3 | n | n | y |
+----------+---+-----+-------+
You can see in the above table that none of the three expressions matches all the directories (located at 3 different depths) at the same time. Frankly speaking, I wasn't expecting the exclusion of 10thOct_logs1 while using the third expression */*/*.
This makes me conclude that whatever files or directories path match against the expression following last / is considered as an exact match, and everything else is ignored.

Update
A new option was introduced in Spark 3 to read from nested folder recursiveFileLookup :
spark.read.option("recursiveFileLookup", "true").json("file:///var/foo/try")
For older versions, alternatively, you can use Hadoop listFiles to list recursively all the file paths and then pass them to Spark read:
import org.apache.hadoop.fs.{Path}
val conf = sc.hadoopConfiguration
// get all file paths
val fromFolder = new Path("file:///var/foo/try/")
val logfiles = fromFolder.getFileSystem(conf).listFiles(fromFolder, true)
var files = Seq[String]()
while (logfiles.hasNext) {
// one can filter here some specific files
files = files :+ logfiles.next().getPath().toString
}
// read multiple paths
val df = spark.read.csv(files: _*)
df.select(input_file_name()).distinct().show(false)
+-------------------------------------+
|input_file_name() |
+-------------------------------------+
|file:///var/foo/try/11thOct/log2.csv |
|file:///var/foo/try/10thOct_logs1.csv|
|file:///var/foo/try/Oct/12th/log3.csv|
+-------------------------------------+

Unfortunately hadoop globs do not support recursive globs. See Querying the Filesystem#File Patterns
There is an option to list multiple globs for each dir level.
{a,b} alternation Matches either expression a or b
You have to be careful not to match same file twice, otherwise it will appear as duplicate.
spark.read.json("./try/{*logs*,*/*logs*,*/*/*logs*}")
You can also load multiple dataframes and union them
val dfs = List(
spark.read.json("./try/*logs*"),
spark.read.json("./try/*/*logs*"),
spark.read.json("./try/*/*/*logs*")
)
val df = dfs.reduce(_ union _)

Related

AWS Glue Crawlers - How to handle large directory structure of CSVs that may only contain strings

Been at this for a few days and any help is greatly appreciated.
Background:
I am attempting to create 1+ glue crawlers to crawl the following S3 "directory" structure:
.
+-- _source1
| +-- _item1
| | +-- _2019 #year
| | | +-- _08 #month
| | | | +-- _30 #day
| | | | | +-- FILE1.csv #files
| | | | | +-- FILE2.csv
| | | | +-- _31
| | | | | +-- FILE1.csv
| | | | | +-- FILE2.csv
| | | +-- _09
| | | | +-- _01
| | | | +-- _02
| +-- _item2
| | +-- _2019
| | | +-- _08
| | | | +-- _30
| | | | +-- _31
| | | +-- _09
| | | | +-- _01
| | | | +-- _02
+-- _source2
| +-- ....
........ # and so on...
This goes on for several sources, each with potentially 30+ items, each of which has the year/month/day directory structure within.
All files are CSVs, and files should not change once they're in S3. However, the schemas for the files within each item folder may have columns added in the future.
2019/12/01/FILE.csv may have additional columns compared to 2019/09/01/FILE.csv.
What I've Done:
In my testing so far, crawlers created at source level directories (see above) have worked perfectly as long as no CSV only contains string-type columns.
This is due to the following restriction, as stated in the AWS docs:
The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header.
Normally, I'd imagine you could get around this by creating a custom classifier that expects a certain CSV schema, but seeing as I may have 200+ items (different schemas) to crawl, I'd like to avoid this.
Proposed Solutions:
Ideally, I'd like to force my crawlers to interpret the first row of
every CSV as a header, but this doesn't seem possible...
Add a dummy INT column to every CSV to force my crawlers to read the CSV headers, and delete/ignore the column down the pipeline. (Seems very hackish)
Find another file format that works (will require changes throughout my ETL pipeline)
DON'T USE GLUE
Thanks again for any help!
Found the issue: Turns out in order for an updated glue crawler classifier to take effect, a new crawler must be created and have the updated classifier applied. As far as I can tell this is not explicitly mentioned in the AWS docs, and I've only seen mention of it over on github
Early on in my testing I modified an existing csv classifier that specifies "Has Columns", but never created a new crawler to apply my modified classifier to. Once I created a new crawler and applied the classifier, all data catalog tables were created as expected regardless of column types.
TL;DR: Modified classifiers will not take effect unless they are applied to a new crawler. Source

Apache Beam: How To Simultaneously Create Many PCollections That Undergo Same PTransform?

Thanks in advance!
[+] Issue:
I have a lot of files on google cloud, for every file I have to:
get the file
Make a bunch of Google-Cloud-Storage API calls on each file to index it(e.g. name = blob.name, size = blob.size)
unzip it
search for stuff in there
put the indexing information + stuff found inside file in a BigQuery Table
I've been using python2.7 and the Google-Cloud-SDK. This takes hours if I run it linearly. I was suggested Apache Beam/DataFlow to process in parallel.
[+] What I've been able to do:
I can read from one file, perform a PTransform and write to another file.
def loadMyFile(pipeline, path):
return pipeline | "LOAD" >> beam.io.ReadFromText(path)
def myFilter(request):
return request
with beam.Pipeline(options=PipelineOptions()) as p:
data = loadMyFile(pipeline,path)
output = data | "FILTER" >> beam.Filter(myFilter)
output | "WRITE" >> beam.io.WriteToText(google_cloud_options.staging_location)
[+] What I want to do:
How can I load many of those files simultaneously, perform the same transform to them in parallel, then in parallel write to big query?
Diagram Of What I Wish to Perform
[+] What I've Read:
https://beam.apache.org/documentation/programming-guide/
http://enakai00.hatenablog.com/entry/2016/12/09/104913
Again, many thanks
textio accepts a file_pattern.
From Python sdk:
file_pattern (str) – The file path to read from as a local file path or a GCS gs:// path. The path can contain glob characters
For example, suppose you have a bunch of *.txt files in storage gs://my-bucket/files/, you can say:
with beam.Pipeline(options=PipelineOptions()) as p:
(p
| "LOAD" >> beam.io.textio.ReadFromText(file_pattern="gs://my-bucket/files/*.txt")
| "FILTER" >> beam.Filter(myFilter)
| "WRITE" >> beam.io.textio.WriteToText(output_ocation)
If you somehow do have multiple PCollections of the same type, you can also Flatten them into a single one
merged = (
(pcoll1, pcoll2, pcoll3)
# A list of tuples can be "piped" directly into a Flatten transform.
| beam.Flatten())
Ok so I resolved this by doing the following:
1) get the name of a bucket from somewhere | first PCollection
2) get a list of blobs from that bucket | second PCollection
3) do a FlatMap to get blobs individually from the list | third PCollection
4) do a ParDo that gets the metadata
5) write to BigQuery
my pipeline looks like this:
with beam.Pipeline(options=options) as pipe:
bucket = pipe | "GetBucketName" >> beam.io.ReadFromText('gs://example_bucket_eraseme/bucketName.txt')
listOfBlobs = bucket | "GetListOfBlobs" >> beam.ParDo(ExtractBlobs())
blob = listOfBlobs | "SplitBlobsIndividually" >> beam.FlatMap(lambda x: x)
dic = blob | "GetMetaData" >> beam.ParDo(ExtractMetadata())
dic | "WriteToBigQuery" >> beam.io.WriteToBigQuery(

Reading S3 files in nested directory through Spark EMR

I figured out how to read files into my pyspark shell (and script) from an S3 directory, e.g. by using:
rdd = sc.wholeTextFiles('s3n://bucketname/dir/*')
But, while that's great in letting me read all the files in ONE directory, I want to read every single file from all of the directories.
I don't want to flatten them or load everything at once, because I will have memory issues.
Instead, I need it to automatically go load all the files from each sub-directory in a batched manner. Is that possible?
Here's my directory structure:
S3_bucket_name -> year (2016 or 2017) -> month (max 12 folders) -> day (max 31 folders) -> sub-day folders (max 30; basically just partitioned the collecting each day).
Something like this, except it'll go for all 12 months and up to 31 days...
BucketName
|
|
|---Year(2016)
| |
| |---Month(11)
| | |
| | |---Day(01)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| | |---Day(02)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| |---Month(12)
|
|---Year(2017)
| |
| |---Month(1)
| | |
| | |---Day(01)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| | |---Day(02)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| |---Month(2)
Each arrow above represents a fork. e.g. I've been collecting data for 2 years, so there are 2 years in the "year" fork. Then for each year, up to 12 months max, and then for each month, up to 31 possible day folders. And in each day, there will be up to 30 folders just because I split it up that way...
I hope that makes sense...
I was looking at another post (read files recursively from sub directories with spark from s3 or local filesystem) where I believe they suggested using wildcards, so something like:
rdd = sc.wholeTextFiles('s3n://bucketname/*/data/*/*')
But the problem with that is it tries to find a common folder among the various subdirectories - in this case there are no guarantees and I would just need everything.
However, on that line of reasoning, I thought what if I did..:
rdd = sc.wholeTextFiles("s3n://bucketname/*/*/*/*/*')
But the issue is that now I get OutOfMemory errors, probably because it's loading everything at once and freaking out.
Ideally, what I would be able to do is this:
Go to the sub-directory level of the day and read those in, so e.g.
First read in 2016/12/01, then 2016/12/02, up until 2012/12/31, and then 2017/01/01, then 2017/01/02, ... 2017/01/31 and so on.
That way, instead of using five wildcards (*) as I did above, I would somehow have it know to look trough each sub-directory at the level of "day".
I thought of using a python dictionary to specify the file path to each of the days, but that seems like a rather cumbersome approach. What I mean by that is as follows:
file_dict = {
0:'2016/12/01/*/*',
1:'2016/12/02/*/*',
...
30:'2016/12/31/*/*',
}
basically for all the folders, and then iterating through them and loading them in using something like this:
sc.wholeTextFiles('s3n://bucketname/' + file_dict[i])
But I don't want to manually type out all those paths. I hope this made sense...
EDIT:
Another way of asking the question is, how do I read the files from a nested sub-directory structure in a batched way? How can I enumerate all the possible folder names in my s3 bucket in python? Maybe that would help...
EDIT2:
The structure of the data in each of my files is as follows:
{json object 1},
{json object 2},
{json object 3},
...
{json object n},
For it to be "true json", it either just needed to be like the above without a trailing comma at the end, or something like this (note square brackets, and lack of the final trailing comma:
[
{json object 1},
{json object 2},
{json object 3},
...
{json object n}
]
The reason I did it entirely in PySpark as a script I submit is because I forced myself to handle this formatting quirk manually. If I use Hive/Athena, I am not sure how to deal with it.
Why dont you use Hive, or even better, Athena? These will both deploy tables ontop of file systems, to give you access to all the data. Then you can capture this in to Spark
Alternatively, I believe you can also use HiveQL in Spark to set up a tempTable ontop of your file system location, and it'll register it all as a Hive table which you can execute SQL against. It's been a while since I've done that, but it is definitely do-able

Equivalent of SQL LIKE operator in R

In an R script, I have a function that creates a data frame of files in a directory that have a specific extension.
The dataframe is always two columns with however many rows as there are files found with that specific extension.
The data frame ends up looking something like this:
| Path | Filename |
|:------------------------:|:-----------:|
| C:/Path/to/the/file1.ext | file1.ext |
| C:/Path/to/the/file2.ext | file2.ext |
| C:/Path/to/the/file3.ext | file3.ext |
| C:/Path/to/the/file4.ext | file4.ext |
Forgive the archaeic way that I express this question. I know that in SQL, you can apply where functions with like instead of =. So I could say `where Filename like '%1%' and it would pull out all files with a 1 in the name. Is there a way use something like this to set a variable in R?
I have a couple of different scripts that need to use the Filename pulled from this dataframe. The only reliable way I can think to tell the script which one to pull from is to set a variable like this.
Ultimately I would like these two (pseudo)expressions to yield the same thing.
x <- file1.ext
and
x like '%1%'
should both give x = file1.ext
you can use grepl() as in this answer
subset(a, grepl("1", a$filename))
Or if you're coming from an SQL background, you might want to look into sqldf
you can use like from data.table to get your sql like behaviour here.
From the documentation see this example
library(data.table)
DT = data.table(Name=c("Mary","George","Martha"), Salary=c(2,3,4))
DT[Name %like% "^Mar"]
for your problem suppose you have a data.frame df like this
path filename
1: C:/Path/to/the/file1.ext file1.ext
2: C:/Path/to/the/file2.ext file2.ext
3: C:/Path/to/the/file3.ext file3.ext
4: C:/Path/to/the/file4.ext file4.ext
do
library(data.table)
DT<-as.data.table(df)
DT[filename %like% "1"]
should give
path filename
1: C:/Path/to/the/file1.ext file1.ext

How to take the output of Sys.command as string in OCaml?

In OCaml, I have this piece of code:
let s =Sys.command ("minisat test.txt | grep 'SATIS' ");;
I want to take the output of minisat test.txt | grep "SATIS" , which is SATISFIABLE/UNSATISFIABLE to the string s.
I am getting the following output:
SATISFIABLE
val s : int = 0
So, how can I make the output of this command to a string.
Also, is it possible to even import time?
This is the output I get when I try minisat test.txt in terminal
WARNING: for repeatability, setting FPU to use double precision
============================[ Problem Statistics ]=============================
| |
| Number of variables: 5 |
| Number of clauses: 3 |
| Parse time: 0.00 s |
| Eliminated clauses: 0.00 Mb |
| Simplification time: 0.00 s |
| |
============================[ Search Statistics ]==============================
| Conflicts | ORIGINAL | LEARNT | Progress |
| | Vars Clauses Literals | Limit Clauses Lit/Cl | |
===============================================================================
===============================================================================
restarts : 1
conflicts : 0 (-nan /sec)
decisions : 1 (0.00 % random) (inf /sec)
propagations : 0 (-nan /sec)
conflict literals : 0 (-nan % deleted)
Memory used : 8.00 MB
CPU time : 0 s
SATISFIABLE
If you use just Sys, you can't.
However, you can create a temporary file (see the Filename module's documentation here) and tell the command to output in it:
let string_of_command () =
let tmp_file = Filename.temp_file "" ".txt" in
let _ = Sys.command ## "minisat test.txt | grep 'SATIS' >" ^ tmp_file in
let chan = open_in tmp_file in
let s = input_line chan in
close_in chan;
s
Note that this function is drafty: you have to properly handle potential errors happening. Anyway, you can adapt it to your needs I guess.
You can avoid the temporary file trick by using the Unix library or more advanced libraries.
You have to use Unix.open_process_in or Unix.create_process, if you want to capture the output.
Or better use a higher level wrapper like 'shell' (from ocamlnet):
http://projects.camlcity.org/projects/dl/ocamlnet-4.0.2/doc/html-main/Shell_intro.html
But I wouldn't pipe it to grep (not portable). Parse the output with your favorite regex library inside OCAML.