HDFS partition data

HDFS partition data - hdfs

I have huge data (TBs) of DNS logs in text files where each record is of the form
timestamp | resolvername | domainlookedfor | dns_answer
where,
timestamp - time at which the record was logged
resolvername - the dns resolver that served the end-user
domainlookedfor - domain that was looked for by the end user
dns_answer - final dns resolution record of 'hostname -> ip address'
As of now, I have individual text files for every five minutes of logs from various dns resolvers. So if I want to see the records in the past 10 days which contain the hostname say www.google.com, then I will have to scan the entire data for the past 10 days (lets say 50GB) and the filter only the records that match the domain (lets say 10MB of data). So obviously there is a huge chunk of data that is read from the disk unnecessarily and it takes a lot of time to get the results.
To improve this situation, I am thinking of partitioning the data based on the domain name and thereby reduce my search space. Also, I would like to retain the notion of records separated based on time (if not for every 5 mins, I would like to have a file for say, every day).
One simple approach that I can think of is,
Bucket the records based on the hash of the domain name (or may be the the first two letters) [domain_AC, domain_AF, domain_AI ... domain_ZZ] where directory domain_AC will have the records for all the domains whose 1st character is A and 2nd character is either A or B or C.
Within each bucket, there will be a separate file for each day [20130129, 20130130, ... ]
So to obtain records for www.google.com, first identify the bucket and then based on the date range, scan the respective files and filter only records that match www.google.com.
Another requirement I have is to group the records based on the resolvername to answer queries such as, get all the records by resolver 'x'.
Please let me know if there are any important details that I should consider and any other known approach to solve this problem. I appreciate any help. Thanks!

Related

how to get AWS quicksight to show the old and new value of a particular column of a table (for comparison purposes)?

what I have seen so far is that the aws glue crawler creates the table based on the latest changes in the s3 files.
let's say crawler creates a table and then I upload a CSV with updated values in one column. the crawler is run again and it updates the table's column with the updated values. I want to be able to show a comparison of the old and new data in quick sight eventually, is this scenario possible?
for example,
right now my csv file is set up as details of one aws service, like RDS is the csv file name and the columns are account id, account name, what region is it in, etc etc
there was one column of percentage with a value 50%, it gets updated with 70%. would I be able to somehow get the old value as well to show in quicksight, to say like previously it was 50% and now its 70%
Maybe this scenerio is not even valid? because I want to be able to show like what account has what cost in xyz month and show how the cost is different in other months. If I make separate tables on each update of csv then there would be 1000+ tables at one point.

If I have understood your question correctly, you are aiming to track data over time. Above you suggest creating a table for each time series, why not instead maintain a record in a table for each time series, you can then create various Analysis over the data, comparing specific months or tracking month-by-month values.

Creating a BigQuery Transfer Service with more complex reges

I have a bucket that stores files based on a transaction time into a filename structure like
gs://my-bucket/YYYY/MM/DD/[autogeneratedID].parquet
Lets assume this structure dates back to 2015/01/01
Some of the files might arrive late, so in theory a new file could be written to the 2020/07/27 structure tomorrow.
I now want to create a BigQuery table that inserts all files with transaction date 2019-07-01 and newer.
My current strategy is to slice the past into small enough chunks to just run batch loads, e.g. by month. Then I want to create a transfer service that listens for all new files coming in.
I cannot just point it to gs://my-bucket/* as this would try to load the date prior to 2019-07-01.
So basically I was thinking about encoding the "future looking file name structures" into a suitable regex, but it seems like the wildcard names https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames only allow for very limited syntax which is not as flexible as awk regex for instance.
I know there are streaming inserts into BQ but still hoping to avoid that extra complexity and just make a smart configuration of the transfer config following the batch load.

You can use scheduled queries with external table. When you query your external table, you can use the pseudo column _FILE_NAME in the where condition of your request and filter on this.

Advice for implementation of a data search system for a static website with AWS infrastructure

I've got static website that I need to implement a search for a seperate data set; I'm currently hosting the site using serverless tech on AWS including S3, Cloudfront, Lambda and API gateway for some server side logic.
I've got several csv files with about 120,000 records in them with a structure like this:
ID search_name name source quantity
10002 Lorem Ipsum Dolor sit amet primary_name 10
10002 Lorem Ipsum Consectetur amet other_name 10
10002 Lorem Ipsum Donec a erat other_name 10
10003 Ultricies pretium Inceptos primary_name 100
10003 Ultricies pretium Himenaeos other_name 100
So the end result will be a search form on my front end that will make an API call to a backend system that queries a database or seperate software service that is able to string match the the 'search_name' field; and then return all the matches. My front end would display the records with 'source' and 'other_name' as meta data in the result rather than seperate results.
A new set of CSV files will be provided every 3 months which will contain the same, and additional records but the 'quantity' field may have a new value.
Since I've been working with serverless tech, my initial thought was to try store the files in
an s3 bucket, use AWS glue to process them and make them available to AWS Athena for querying. I quiet like this setup since there aren't a lot of components to maintain and the hosting costs will be low My two concerns with this setup are the time I'll spend trying to engineer a nice search algorithm that can sort results according to how close a match they are. E.g. if a search name is ABC, it should be the first result as opposed to other items that just have ABC as part of their name.
Secondly execution speed; I've ran some simple queries like this:
SELECT id, search_name, source
FROM data
WHERE search_name like '%lorem%';
Just using the Query editor in the Athena GUI, and the execution time can range from 0.5 to 3 seconds. It's those 3 second executions that concern me. I'm wondering how well this can be optimized. I've also read "Users can only submit one query at a time and can only run up to five simultaneous queries for each account.", unless there's some caveat to my understanding of that, sounds like it kind of kills it for me.
As a second option I was thinking of using AWS ElasticSearch. I don't know a whole lot about it but I figured that using a system that was engineered to perform search may make my final product much better. I don't know a lot about implementing it, but my concerns here are, again my ability to prioritise certain search results, and how easy its going to be to perform that data injestion process, e.g. when a new set of data arrives it needs to update the records rather than just stack on top of them. I wrote an initial script to load the csv records in there to test a query.
I've just started looking at AWS CloudSearch now which is actually looking a bit simpler than ElasticSearch, so starting to lean that way.
So the advice I'm looking for is a recommendation on what products or services I should use, be it Athena, ElasticSearch or something else, and any top level advice on how I should be implementing those services.
Thank you.

Just using the Query editor in the Athena GUI, and the execution time can range from 0.5 to 3 seconds. It's those 3 second executions that concern me. I'm wondering how well this can be optimized. I've also read "Users can only submit one query at a time and can only run up to five simultaneous queries for each account.", unless there's some caveat to my understanding of that, sounds like it kind of kills it for me.
One point you should concern with the most is: Who is going to use your application? If It is only myself, I would have no problem with a few Athena queries and the slow response time. If your application is public-facing, however, think seriously about the traffic and the amount of money you are going to pay for Athena to scan your dataset over and over.
Quick breakdown
Athena: Quickly have an overview of your CSV data right where it sits ( S3 ). No complex ETL / Ingestion or indexing needed. It is not particular strong at "Searching"
CloudSearch: Check if It is still maintained/ updated. I have a feeling It is no longer the case. Use It at your own risk!
ElasticSearch. Strong at searching. Especially "nature language based" searches. You can customize the ranking weight and stuffs like that may fit your requirement
I would recommend you to go for self-hosted ElasticSearch

I decided to invest my time in ElasticSearch over CloudSearch after reading this article; the author implemented a similar system with CloudSearch and stated they would have used ElasticSearch instead if they were starting over.
Athena wasn't a good fit for search given the work I'd need to do to try and optimize it, nor was its restrictions good for a public facing website.
I couldn't avoid scripting the import process for my data, so I've ended up writing scripts to fetch the files from an s3 bucket, validate them, denomarlize the data, and send it to ElasticSearch under a new index. These scripts will end up in Lambda functions which will facilitate a fully automated process to update the data set.
One of the nice things I found ElasticSearch has is the ability to alias an index. Since the CSV's I receive periodically are the complete source of truth for my data; I can automate the import into a new unique index name based on a timestamp. Once the import is complete I've scripted an API request to shift the alias from the old index to the new one, and then delete the old index. So I can replace the entire dataset in ElasticSearch and then set it live without downtime or mixed data sets. Before I found out about alias's I thought I would have to perform updates to the existing index or create a new one and then update the website to refer to the url of the new search index.

Cannot get data (100k+ rows) for a dashboard

Pretty new to the dynamoDb and the whole AWS, it's very exciting but I feel the learning curve is a bit steep. Anyway, here is my situation and my problem.
We have a mobile react native app which stores into a dynamoDb table one row each time the users are doing a search. (the database is a search history with a UUID and then the search criteria). On average we have a few thousands new searches into the table every day. The table has just a primary key which is the search id.
The app is quite new but we are reaching the few hundred thousand rows in the table already and can expect having a million in the following months. The data is plain simple data with unique id and string and numbers in the other attributes. No connection, no relationship, etc... That's already when I felt maybe DynamoDb may not have been the best choice but still, I read everywhere it can be suitable for anything if properly managed.
Next to this there is a webapp dashboard which -thanks to a rest api using nodejs lambdas- queries the dynamoDB to make statistics about the searches: how many searches per day, list of last searches... the problem is DynamoDb is not really suitable to query hundred thousands of data (the 1mb limit, query limitations, credits...).
When I do a scan I get only 3000 searches. I tried to make a loop on the scan using the last index requested but after a few test I did not get data and I blocked the maximum throughput. It seems really clear that I don't have the right approach to bring all these searches to my web app. So now what would be the right approach? My ideas are the following but I am open to more experienced one:
Switching to a SQL database (using the aws migration ?). Will it really be easier then?
creating lambdas to execute scheduled jobs every night to make statistics every day so that I don't have to query the full database all the time but just some of the most recent searches and the statistics rows? Is it doable? any node.js / lambdas tutorial you may know regarding this?
better management of indexes? I am still very lost regarding those.
Looking forward to your opinions.

Add another layer to take care for full text search.
For example, with Elasticsearch, or Algolia or other similars.
Notes:
Elasticsearch may be cost you a lot if compare the cost on dynamodb
Reference:
https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-dynamodb-elasticsearch-integration/

content replacement in S3 files when unique id matched in both the sides by using big data solutions

I am trying to explore on a use case like "we have huge data (50B records) in files and each file has around 50M records and each record has a unique identifier. And it is possible that a record that present in file 10 can also present in file 100 but the latest state of that record is present in file 100. Files sits in AWS S3.
Now lets say around 1B records out of 50B records needs reprocessing and once reprocessing completed, we need to identify all the files which ever has these 1B records and replace the content of those files for these 1B unique ids.
Challenges: right now, we dont have a mapping that tells which file contains what all unique ids. And the whole file replacement needs to complete in one day, which means we needs parallel execution.
We have already initiated a task for maintaining the mapping for file to unique ids, and we need to load this data while processing 1B records and look up in this data set and identify all the distinct file dates for which content replacement is required.
The mapping will be huge, because it has to hold 50B records and may increase as well as it is a growing system.
Any thoughts around this?

You will likely need to write a custom script that will ETL all your files.
Tools such as Amazon EMR (Hadoop) and Amazon Athena (Presto) would be excellent for processing the data in the files. However, your requirement to identify the latest version of data based upon filename is not compatible with the way these tools would normally process data. (They look inside the files, not at the filenames.)
If the records merely had an additional timestamp field, then it would be rather simple for either EMR or Presto to read the files and output a new set of files with only one record for each unique ID (with the latest date).
Rather than creating a system to lookup unique IDs in files, you should have your system output a timestamp. This way, the data is not tied to a specific file and can easily be loaded and transformed based upon the contents of the file.
I would suggest:
Process each existing file (yes, I know you have a lot!) and add a column that represents the filename
Once you have a new set of input files with the filename column (that acts to identify the latest record), use Amazon Athena to read all records and output one row per unique ID (with the latest date). This would be a normal SELECT... GROUP BY statement with a little playing around to get only the latest record.
Athena would output new files to Amazon S3, which will contain the data with unique records. These would then be the source records for any future processing you perform.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js