Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have to create an Application to read some live data feed from more than 200 tables simultaneously and process this data. I want to discuss what could be the best approach to solve this problem with optimum speed as for each table we are getting 20+ records in every minute. So far I can think of following solutions :-
1) I can make multiple thread handling some 20 odd symbols independently.
2) I can make two thread one for data read and other for data processing but reader thread will take more time as it has to read all tables sequentially.
my database is MySQL and I am not looking to shift to nosql DB right now.I am using C++ to solve this problem.I feel that if instead of 200+ tables I can get live data feed in a single table then my second approach will become much appropriate and faster.
Is the use of MySQL required if not you might get a speed increase from any nosql "database". Furthermore retrieving data from a database is always a bottleneck, generally when it comes to that much data volume you want to load as much as you can into RAM and read it from there, as it is much faster.
You could make a query that would only retrieve the newest data from a certain timestamp(which is the same timestamp of the execution of your last query) then load that into memory do all the operations that require speed there, and clean up old entries that are not required anymore.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 months ago.
Improve this question
I am building a software to analyze log files from ArduPilot in c++.
The Data from the files comes in the fallowing form:
Sensor name (GYRO, Barometer, ect).
Each sensor has several fields of data, for example the barometer has the fallowing fields:
Altitude, Pressure, Temperature, Offset and some more.
All the inputs in the log file that record the Barometer data will have all these fields.
Example of line in log file:
BARO, 843762779, 0, -1.443359, 94956.91, 43.06, -1.074093, 843762, 0, 28.38455, 1
Here is the general idea:
list of Sensors: BARO, GYRO, BAT ...
Every Sensor has some fields
Every Field should have ether a float array, or a float vector.
This way I can feed the Graph module with the address of the vector to display the data of the field.
I would love some help how to build the data structure.
So I can easily add data every time I read a line with more sensor data.
Easily access an array/vector of a single field for graph display.
Any ideas?
Edit:
To clear things up:
I can have 100,000 readings per field X many fields per sensor X many sensors...
I can't make up my mind if to use vectors on the heap, of pointers to vectors on the stack.
Should I use somthing like unordered_map for quick access
unordered_map<int,somthing>
Where int is the sensor's id
Maybe you can bundle the individual values in a struct? Something like:
struct Sensor {
std::string name;
double pressure;
double temperature
...
};
and then collect all sensors in a std::vector<Sensor> ?
this update takes 194 seconds for 220mln rows.
Is there a way to improve that?
#standardSQL
UPDATE dataset.people SET CBSA_CODE = '54620' where substr(zip,1,5) = '99047'
When asking for performance help, it is useful to include a screenshot of the Execution Plan from the BigQuery UI to see which stages are the most intensive, and where the time was spent. Without that, though, I suspect that this small optimization should help:
UPDATE dataset.people SET CBSA_CODE = '54620' WHERE zip LIKE '99047%'
BigQuery should be able to push this filter down to its storage system, since it's a more natural way to express string containment, so if you see a high "Read" time in the Execution Plan for the original query, this might reduce it.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm working on a research project and am assigned to do a bit of data scraping and writing code in R that can help extract current temperature for a particular zip code from a site such as wunderground.com. Now this may be a bit of an abstract question but does anyone know how to do the following:
I can extract the current temperature of a particular zip code by doing this:
temps <- readLines("http://www.wunderground.com/q/zmw:20904.1.99999")
edit(temps)
temps //gives me the source code for the website where I can look at the line that contains the temperature
ldata <- temps[lnumber]
ldata
# then have a few gsub functions that basically extracts
# just the numerical data (57.8 for example) from that line of code
I have a cvs file that contains zip code of every city in the country and I have that imported in R. It is arranged in a table according to zip, city and state. My challenge now is to write a method (using java analogy here because I'm new to R) that basically extracts 6-7 consecutive zip codes (after a particular one specified) and runs the above code by modifying the link within the readLines function and putting in the respective zip code after the link segment zmw:XXXXX and running everything after that based on that link. Now I don't quite know how to extract the data from the table. Maybe with a for-loop function? But then I don't know how to use that to modify the link. I think that's where I'm really getting stuck on. I have a bit of Java background so I understand HOW to approach this problem, just not the knowledge of the syntax. I understand this is quite an abstract question as I didn't provide a lot of code but I just want to know they functions/syntax that will help me extract the data from the table and somehow use that to modify the link through a function rather than manually doing it.
So this is about the Weather Underground data.
You can download csv files from individual weather stations in wunderground, however you need to know the weather station identifier. Here is an example URL for a weather station in Kirkland, WA (KWAKIRKL8):
http://www.wunderground.com/weatherstation/WXDailyHistory.asp?ID=KWAKIRKL8&day=31&month=1&year=2014&graphspan=day&format=1
Here is some R code:
url <- 'http://www.wunderground.com/weatherstation/WXDailyHistory.asp?ID=KWAKIRKL8&day=31&month=1&year=2014&graphspan=day&format=1'
s <- getURL(url)
s <- gsub("<br>\n","",s)
wdf <- read.csv(con<-textConnection(s))
And here is a page with which you can manually find stations and their codes.
http://www.wunderground.com/wundermap/
Since you only need a few you can pick them out manually.
I'm currently developing a strategy for an incremental update of our user data. We assume 100_000_000 records in our database of which approximately 1_000_000 records are updated per workflow.
The idea is to update records in a MapReduce job. Is it useful to use an indexed storage (eg. Cassandra) to be able to access current records randomly? Or is it preferable to retrieve data from HDFS and join new information to existing records.
The record size is O(200 Bytes). The user data has a fixed length but should be extendable. The log events have a similar but not equal structure. The number of user records is likely to grow. Near real-time updates are desirable, ie. a 3 hour time gap is not acceptable, few minutes is OK.
Have you made any experiences with either of these strategies and data of this size?
Is the pig JOIN fast enough? Is it a bottleneck always to read all records? Is Cassandra able to hold this amount of data efficiently? Which solution is scalable? What about the complexity of the system?
You need to define your requirements first. Your record volumes are not a problem, but you don't give a record length. Are they fixed length, fixed field number, likely to change format over time? Are we talking 100 byte records or 100,000 byte records? You need an index on a field/column if you wish to query by that field/column, unless you do all your work using map/reduce. Will the number of user records stay at 100mill (1 server will probably suffice) or will it grow 100% per year ( probably multiple servers adding new ones over time).
How you access records for updating depends on whether you need to update them in real-time or whether you can run a batch job. Will updates be every minute, or hour, or month?
I would strongly suggest you do some experimenting. Have you done any testing already? This will give you a context for your questions and this will lead to more objective questions and answers. It is unlikely that you can 'whiteboard' a solution based on your question.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a set of training data consisting of 20 multiple choice questions (A/B/C/D) answered by a hundred respondents. The answers are purely categorical and cannot be scaled to numerical values. 50 of these respondents were selected for free product trial. The selection process is not known. What interesting knowledge can be mined from this information?
The following is a list of what I have come up with so far-
A study of percentages (Example - Percentage of people who answered B on Qs.5 and got selected for free product trial)
Conditional probabilities (Example - What is the probability that a person will get selected for free product trial given that he answered B on Qs.5)
Naive Bayesian classifier (This can be used to predict whether a person will be selected or not for a given set of values for any subset of questions).
Can you think of any other interesting analysis or data-mining activities that can be performed?
The usual suspects like correlation can be eliminated as the response is not quantifiable/scoreable.
Is my approach correct?
It is kind of reverse engineering.
For each respondent, you have 20 answers and one label, which indicates whether this respondent gets the product trial or not.
You want to know which of the 20 questions are critical to give trial or not decision. I'd suggest you first build a decision tree model on the training data. And study the tree carefully to get some insights, e.g. the low level decision nodes contain most discriminant questions.
The answers can be made numeric for analysis purposes, example:
RespondentID IsSelected Q1AnsA Q1AnsB Q1AnsC Q1AnsD Q2AnsA...
12345 1 0 0 1 0 0
Use association analysis to see if there are patterns in the answers.
Q3AnsC + Q8AnsB -> IsSelected
Use classification (such as logistic regression or a decision tree) to model how users are selected.
Use clustering. Are there distinct groups of respondents? In what ways are they different? Use the "elbow" or scree method to determine the number of clusters.
Do you have other info about the respondents, such as demographics? Pivot table would be good in that case.
Is there missing data? Are there patterns in the way that people skipped questions?