How to pass arguments to streaming job on Amazon EMR - amazon-web-services

I want to produce the output of my map function, filtering the data by dates.
In local tests, I simply call the application passing the dates as parameters as:
cat access_log | ./mapper.py 20/12/2014 31/12/2014 | ./reducer.py
Then the parameters are taken in the map function
#!/usr/bin/python
date1 = sys.argv[1];
date2 = sys.argv[2];
The question is:
How do I pass the date parameters to the map calling on Amazon EMR?
I am a beginner in Map reduce. Will appreciate any help.

First of all,
When you run a local test, and you should as often as possible.
the correct format (in order to reproduce how map-reduce works) is:
cat access_log | ./mapper.py 20/12/2014 31/12/2014 | sort | ./reducer.py | sort
That the way the hadoop framework works.
If you are looking on a big file, you should do it in steps to verify results of each line.
meaning:
cat access_log | ./mapper.py 20/12/2014 31/12/2014 > map_result.txt
cat map_result.txt | sort > map_result_sorted.txt
cat map_result_sorted.txt | ./reducer.py > reduce_result.txt
cat reduce_result.txt | sort > map_reduce_result.txt
In regard to your main question:
Its the same thing.
If you are going to use the amazon web console to create your cluster, in the add step window you just write as fallowing:
name: learning amazon emr
Mapper: (here they say: please give us s3 path to your mapper, we will ignore that, and just write our script name and parameters, no backslash...) mapper.py 20/12/2014 31/12/2014
Reducer: (the same as in the mapper) reducer.py (you can add here params too)
Input location: ...
Output location: ... (just remember to use a new output every time, or your task will fail)
Arguments: -files s3://cod/mapper.py,s3://cod/reducer.py (use your file path here, even if you add only one file use the -files argument)
That's it
If you are going into the all argument thing, i suggest you see this guy blog on how to use the passing of arguments in order to use only a single map,reduce file.
Hope it helped

Related

Regex in Spark SQL is resulting in wrong value but working fine on Amazon Athena

I am trying to extract text that exists inside root level brackets from a string in Spark-SQL. I have used the function regexp_extract() on both Spark-SQL and Athena on the same string with the same regex.
On Athena, it's working fine.
But on Spark-SQL, it is not returning the value as expected.
Query is:
SELECT regexp_extract('Russia (Federal Service of Healthcare)', '.+\s\((.+)\)', 1) AS cl
Output On Athena:
Federal Service of Healthcare
Output on Spark-SQL:
ia (Federal Service of Healthcare)
I am bumping my head around but can't seem to find a solution around this.
This does the trick:
SELECT regexp_extract('Russia (Federal Service of Healthcare)', '.+\\\\s\\\\((.+)\\\\)', 1) AS cl
output:
+-----------------------------+
|cl |
+-----------------------------+
|Federal Service of Healthcare|
+-----------------------------+
The s is not being escaped in your example, that's why it falls as part of the group; you can also use the regexp_extract API directly which makes a cleaner solution:
.withColumn("cl", regexp_extract(col("name"), ".+\\s\\((.+)\\)", 1))
Good luck!

allocate largest partition with ansible when parition name varies

I'm using ansible to configure some AWS servers. All the servers have a 1024 GB, or larger, partition. I want to allocate this partition and assign it to /data.
I have an ansible script that does this on my test machine. However, when I tried running it against all my AWS machines it failed on some, complaining /dev/nvme1n1 device doesn't exist. The problem is that some of the servers had a 20 GB root partition separate from the larger partition, and some don't. That means sometimes the partition I care about is nvme1n1 and sometimes it's nvm0n1.
I don't want to place a variable in the hosts file, since that file is being dynamically loaded from AWS anyways. Given that what is the easiest way to look up the largest device and get it's name in ansible so I can correctly tell ansible to allocate whichever device is largest?
I assume, when you talk, about "partitions", you mean "disks", as a partition will have a name like nvme0n1p1, while the disk would be called nvme0n1.
That said, I have not found an "ansible-way" to do that, so I usually parse lsblk and do some grep-magic. In your case this is what you need to run:
- name: find largest disk
shell: |
set -euo pipefail
lsblk -bl -o SIZE,NAME | grep -P 'nvme\dn\d$' | sort -nr | awk '{print $2}' | head -n1
args:
executable: /bin/bash
register: largest_disk
- name: print name of largest disk
debug:
msg: "{{ largest_disk.stdout }}"
You can use the name of the disk in the parted-module to do whatever you need with it.
Apart from that, you should add some checks before formatting your disks, so you don't overwrite something (e.g. if your playbook ran on a host before, the disk might already be formatted and contain data, and in that case, you would not format it again, because you would overwrite the existing data).
Explanation:
lsblk -bl -o SIZE,NAME prints the size and names of all block-devices
grep -P 'nvme\dn\d$' greps out all disks (partitions have some pXX in the end, remember?)
sort -nr sorts the output numerically by the first column (so you get the largest on top)
awk '{print $2}' prints only the second column (the name that is)
head -n1 returns the first line (containing the name of the largest disk)

How to process dataflow two batch files simultaneously on GCP

I want to process two files from gcp to dataflow at the same time simultaneously.
I think it will be possible if one more file comes in side-input.
However, in this case, I think it will be processed every time, not just once.
e.g) How to read and process file1 and file2 at the same time (do I have to put two files in one file and just follow the path?)
I'd appreciate it if you could give me a good example or advice.
Thank you.
If you know the 2 files from the beginning you can simply have a pipeline with 2 entry (fileIO)
I don't know your language, but by design you can do this
PCollection1 PCollection2
| |
FileIO(readFile1) FileIO(readFile2)
| |
Transform file Transform file
| |
WriteIO(sink) WriteIO(sink)
You can imagine side input, flatten, group by,... all depends on your needs.

sd_journal_send to send binary data. How can I retrieve the data using journalctl?

I'm looking at systemd-journal as a method of collecting logs from external processors. I'm very interested in it's ability to collect binary data when necessary.
I'm simply testing and investigating journal right now. I'm well aware there are other, probably better, solutions.
I'm logging binary data like so:
// strData is a string container containing binary data
strData += '\0';
sd_journal_send(
"MESSAGE=test_msg",
"MESSAGE_ID=12345",
"BINARY=%s", strData.c_str(),
NULL);
The log line shows up when using the journalctl tool. I can find the log line like this from the terminal:
journalctl MESSAGE_ID=12345
I can get the binary data of all logs in journal like so from the terminal:
journalctl --field=BINARY
I need to get the binary data to a file so that I can access from a program and decode it. How can I do this?
This does not work:
journalctl --field=BINARY MESSAGE_ID=12345
I get there error:
"Extraneous arguments starting with 'MESSAGE_ID=1234567890987654321"
Any suggestions? The documentation on systemd-journal seems slim. Thanks in advance.
You just got the wrong option. See the docs for:
-F, --field=
Print all possible data values the specified field can take in all entries of the journal.
vs
--output-fields=
A comma separated list of the fields which should be included in the output.
You also have to specify the plain output format (-o cat) to get the raw content:
journalctl --output-fields=BINARY MESSAGE_ID=12345 -o cat

Write to dynamic destination to cloud storage in dataflow in Python

I was trying to read from a big file in cloud storage and shard them according to a given field.
I'm planning to Read | Map(lambda x: (x[key field], x)) | GroupByKey | Write to file with the name of the key field.
However I couldn't find a way to write dynamically to cloud storage. Is this functionality supported?
Thank you,
Yiqing
Yes, you can use the FileSystems API to create the files.
An experimental write was added to the Beam python SDK in 2.14.0, beam.io.fileio.WriteToFiles:
my_pcollection | beam.io.fileio.WriteToFiles(
path='/my/file/path',
destination=lambda record: 'avro' if record['type'] == 'A' else 'csv',
sink=lambda dest: AvroSink() if dest == 'avro' else CsvSink(),
file_naming=beam.io.fileio.destination_prefix_naming())
which can be used to write to different files per-record.
You can skip the GroupByKey, just use destination to decide which file each record is written to. The return value of destination needs to be a value that can be grouped by.
More documentation here:
https://beam.apache.org/releases/pydoc/2.14.0/apache_beam.io.fileio.html#dynamic-destinations
And the JIRA issue here:
https://issues.apache.org/jira/browse/BEAM-2857