Hello all data pipeline experts!
Currently, I'm about to set up data ingestion from a MQTT source. All my MQTT topics contain float values, except a few ones from RFID scanners contain uuids that should be read in as strings. The RFID topics have a "RFID" in their topic name, specifically, they are of format "/+/+/+/+/RFID".
I would like to transfer all topics EXCEPT the RFID topics to float and store them in an influx db measurement "mqtt_data". The RFID topics should be stored as strings in the measurement "mqtt_string".
Yesterday, I fiddled around a lot with Processors and got no results other than headache. Today, I had a first success:
[[outputs.influxdb_v2]]
urls = ["http://localhost:8086"]
organization = "xy"
bucket = "bucket"
token = "ExJWOb5lPdoYPrJnB8cPIUgSonQ9zutjwZ6W3zDRkx1pY0m40Q_TidPrqkKeBTt2D0_jTyHopM6LmMPJLmzAfg=="
[[inputs.mqtt_consumer]]
servers = ["tcp://127.0.0.1:1883"]
qos = 0
connection_timeout = "30s"
name_override = "mqtt_data"
## Topics to subscribe to
topics = [
"+",
"+/+",
"+/+/+",
"+/+/+/+",
"+/+/+/+/+/+",
"+/+/+/+/+/+/+",
"+/+/+/+/+/+/+/+",
"+/+/+/+/+/+/+/+/+",
]
data_format = "value"
data_type = "float"
[[inputs.mqtt_consumer]]
servers = ["tcp://127.0.0.1:1883"]
qos = 0
connection_timeout = "30s"
name_override = "mqtt_string"
topics = ["+/+/+/+/RFID"]
data_format = "value"
data_type = "string"
as you can see, in the first mqtt_consumer, I left out all topics containing 5 levels of hierarchy. So it would miss those topics. Listing all number of hierarchy levels isn't nice either.
My question would be:
Is there a way to formulate a regex that negates the second mqtt_consumer block, i.e. selecting all topics that are not of the form "+/+/+/+/RFID" ? ... or is there another complete different, more elegant approach I'm not aware of ...
Although I worked before with regex'es, I got stuck at this point. Thanks for any hints to that!!!
Related
I am facing a similar problem, using boto3 the query does not work, while it works on console.
First I tried this scan without success:
text = 'city:barcelona'
filter_expr = Attr('timestamp').between('2020-04-01', '2020-04-27')
filter_expr = filter_expr & Attr('text').eq(text)
table.scan(FilterExpression = filter_expr, Limit = 1000)
Then, I notice that for a text variable that does not contain ":", the scan works.
So, I tried this second scan using ExpressionAttributeNames and ExpressionAttributeValues
table.scan(
FilterExpression = "#n0 between :v0 AND :v1 AND #n1 = :v2",
ExpressionAttributeNames = {'#n0': 'timestamp', '#n1': 'text'},
ExpressionAttributeValues = {
':v0': '2020-04-01',
':v1': '2020-04-27',
':v2': {"S": text}},
Limit = 1000
)
Failed again.
By the end, if I change in the first example:
text = 'barcelona'
filter_expr = filter_expr & Attr('text').contains(text)
I can get the records. IMO, it is clear that the problem is the ":"
Is there another way to search by texts with ":" character?
[writing an answer so that we can close out the question]
I ran both examples and they worked correctly for me. I configured text and timestamp as string fields. Check you have an up to date boto3 library.
Note: I changed ':v2': {"S": text} to ':v2': text because you're using resource level scan and you don't need to supply the low-level attribute type (it's only required for client level scan).
We we successful in extracting the data from twitter but we couldn't save it on our system using flume.Can you please explain
you might have problem in channel or sink may be that's why u r data is not storing in hdfs.
try to understan this one
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://yourIP:8020/user/flume/tweets/%Y/%M/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
and chek with jps if your data node and namenode is working
I have set of documents which has the server name, with the start timestamp and end timestamp of that server. eg.
[
{
serverName: "Houston",
startTimestamp: "2018/03/07 17:52:13 +000",
endTimestamp: "2018/03/07 18:50:10 +000"
},
{
serverName: "Canberra",
startTimestamp: "2018/03/07 18:48:09 +000",
endTimestamp: "2018/03/07 20:10:00 +000"
},
{
serverName: "Melbourne",
startTimestamp: "2018/03/08 01:43:13 +000",
endTimestamp: "2018/03/08 12:09:10 +000"
}
]
With this data, given a Timestamp I need to get the list of active servers at that point of time.
For example. for TS="2018/03/07 18:50:00 +000" from the above data the list of active servers are ["Huston", "Canberra"]
Is it possible to achieve this using only CouchDB views. If so how to go about it?
Note: Initially I tried the following approach. In the map function I emit two documents
1 with key=doc.startTimestsamp and value={"station_add": doc.station}
1 with key=doc.startEndtsamp and value={"station_rem": doc.station}
My intention was to iterate through these in the reduce function adding stations present in "station_add" and removing stations in "stations_rem". But I found that CouchDB does not mention anything about the ordering of values in the reduce function.
If you can live with fixed periods and don't mind the extra disk space that might be needed for the view results, you can create a view of active servers per hour, for example.
Iterate over the periods between start and end and emit the time that each server was online during this period:
function(doc) {
var start = new Date(doc.startTimestamp).getTime()
var end = new Date(doc.endTimestamp).getTime()
var msPerPeriod = 60*60*1000
var msOfflineInFirstPeriod = start % msPerPeriod
var firstPeriod = start - msOfflineInFirstPeriod
var msOnlineInLastPeriod = end % msPerPeriod
var lastPeriod = end - msOnlineInLastPeriod
if (firstPeriod === lastPeriod) {
// The server was only online within one period.
emit([new Date(firstPeriod), doc.serverName], [1, msOnlineInLastPeriod - msOfflineInFirstPeriod])
} else {
// The server was online over multiple periods.
emit([new Date(firstPeriod), doc.serverName], [1,msPerPeriod - msOfflineInFirstPeriod])
for (var period = firstPeriod + msPerPeriod; period < lastPeriod; period += msPerPeriod) {
emit([new Date(period), doc.serverName], [1, msPerPeriod])
}
emit([new Date(lastPeriod), doc.serverName], [1,msOnlineInLastPeriod])
}
}
If you want the total without the server names, just add a reduce function with the built-in shortcut _sum. You'll get the number of servers online during the period as the first number and the milliseconds that the servers were online in that period as the second number.
You can play with the view if you emit the year, month and day as the first keys. Then you can use the group_level at query time to get a finer or more coarse overview.
Bear in mind that this view might get large on disk, as each row has to be stored, and also the intermediate results for each group level are stored. So you shouldn't set the period duration too small – emitting a row for each second would take a lot of disk space, for example.
I have been collecting tweets from the past week to collect the past-7-days tweets related to "lung cancer", yesterday, I figured I needed to start collecting more fields, so I added some fields and started re-collecting the same period of Tweets related to "lung cancer" from last week. The problem is, the first time I've collected ~2000 tweets related to lung cancer on 18th, Sept 2014. But last night, it only gave ~300 tweets, when I looked at the time of the tweets for this new set, it's only collecting tweets from something like ~23:29 to 23:59 on 18th Sept 2014. A large chunk of data is obviously missing. I don't think it's something with my code (below), I have tested various ways including deleting most of the fields to be collected and the time of data is still cut off prematurely.
Is this a known issue with Twitter API (when collecting last 7 days' data)? If so, it will be pretty horrible if someone is trying to do serious research. Or is it somewhere in my code that caused this (note: it runs perfectly fine for other previous/subsequent dates)?
import tweepy
import time
import csv
ckey = ""
csecret = ""
atoken = ""
asecret = ""
OAUTH_KEYS = {'consumer_key':ckey, 'consumer_secret':csecret,
'access_token_key':atoken, 'access_token_secret':asecret}
auth = tweepy.OAuthHandler(OAUTH_KEYS['consumer_key'], OAUTH_KEYS['consumer_secret'])
api = tweepy.API(auth)
# Stream the first "xxx" tweets related to "car", then filter out the ones without geo-enabled
# Reference of search (q) operator: https://dev.twitter.com/rest/public/search
# Common parameters: Changeable only here
startSince = '2014-09-18'
endUntil = '2014-09-20'
suffix = '_18SEP2014.csv'
############################
### Lung cancer starts #####
searchTerms2 = '"lung cancer" OR "lung cancers" OR "lungcancer" OR "lungcancers" OR \
"lung tumor" OR "lungtumor" OR "lung tumors" OR "lungtumors" OR "lung neoplasm"'
# Items from 0 to 500,000 (which *should* cover all tweets)
# Increase by 4,000 for each cycle (because 5000-6000 is over the Twitter rate limit)
# Then wait for 20 min before next request (becaues twitter request wait time is 15min)
counter2 = 0
for tweet in tweepy.Cursor(api.search, q=searchTerms2,
since=startSince, until=endUntil).items(999999999): # changeable here
try:
'''
print "Name:", tweet.author.name.encode('utf8')
print "Screen-name:", tweet.author.screen_name.encode('utf8')
print "Tweet created:", tweet.created_at'''
placeHolder = []
placeHolder.append(tweet.author.name.encode('utf8'))
placeHolder.append(tweet.author.screen_name.encode('utf8'))
placeHolder.append(tweet.created_at)
prefix = 'TweetData_lungCancer'
wholeFileName = prefix + suffix
with open(wholeFileName, "ab") as f: # changeable here
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
counter2 += 1
if counter2 == 4000:
time.sleep(60*20) # wait for 20 min everytime 4,000 tweets are extracted
counter2 = 0
continue
except tweepy.TweepError:
time.sleep(60*20)
continue
except IOError:
time.sleep(60*2.5)
continue
except StopIteration:
break
Update:
I have since tried running the same python scripts on a different computer (which is faster and more powerful than my home laptop). And the latter resulted in the expected number of tweets, I don't know why it's happening as my home laptop works fine for many programs, but I think we could rest the case and rule out the potential issues related to the scripts or Twitter API.
If you want to collect more data, I would highly recommend the streaming api that Tweepy has to offer. It has a much higher rate limit, in fact I was able to collect 500,000 tweets in just one day.
Also your rate limit checking is not very robust, you don't know for sure that Twitter will allow you to access 4000 tweets. From experience, I found that the more often you hit the rate limit the fewer tweets you are allowed and the longer you have to wait.
I would recommend using:
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
so that your application will not exceed the rate limit, alternatively you should check what you have used with:
print (api.rate_limit_status())
and then you can just sleep the thread like you have done.
Also your end date is incorrect. The end date should be '2014-09-21', one higher than whatever todays date is.
I am looking for a GEO IP database (Similar to MaxMind GeoLite2 Country and City) that will allow me to identify the US state that the user is coming from in order to target specific content to that user.
Does anyone know how or where I could find such a database/service or solution?
Don't expect high accuracy, unless you're satisfied with country/city precision. It is after all IP based geolocation data and the accuracy varies. It is based on ISP providers or companies that manages the databases(commercial databases may have higher accuracy). Look at an IP location info webtool ( like http://geoipinfo.org/ ) and you'll see approximately where it finds you, and it also provided accuracy at city and country levels - percentage wise. It used the ip2location database for lookups and their precision data.
This thread is old, but as of today, I used http://api.ipstack.com and it's working perfectly. They have a VERY extensive help examples on their site, but basically you make the call, parse the data and get what you want.
First, be sure you have any/all includes (Namespace=System.Xml, System.Net, blah, blah).
Second, be sure not to test on a private network IP (192.168.x.x or 10.x.x.x) because that will always return blank/empty fields and you'll think something is coded wrong.
Third, you will need an Acess_Key from ipstack.com ... you can setup a FREE account (10,000 requests a month I think) and get your access code to enter into your string below for API call. I filled out the form and was up and running in 10 minutes for free.
This worked for me to track visitors to any page:
string IP = "";
string strHostName = "";
string strHostInfo = "";
string strMyAccessKeyForIPStack = "THEYGIVEYOUTHISWHENYOUSETUPFREEACCOUNT";
strHostName = System.Net.Dns.GetHostName();
IPHostEntry ipEntry = System.Net.Dns.GetHostEntry(strHostName);
IPAddress[] addr = ipEntry.AddressList;
IP = addr[2].ToString();
XmlDocument doc = new XmlDocument();
string strMyIPToLocate = "http://api.ipstack.com/" + IP + "?access_key=strMyAccessKeyForIPStack&output=xml";
doc.Load(strMyIPToLocate);
XmlNodeList nodeLstCity = doc.GetElementsByTagName("city");
XmlNodeList nodeLstState = doc.GetElementsByTagName("region_name");
XmlNodeList nodeLstZIP = doc.GetElementsByTagName("zip");
XmlNodeList nodeLstLAT = doc.GetElementsByTagName("latitude");
XmlNodeList nodeLstLON = doc.GetElementsByTagName("longitude");
strHostInfo = "IP is from " + nodeLstCity[0].InnerText + ", " + nodeLstState[0].InnerText + " (" + nodeLstZIP[0].InnerText + ")";
// Then I do what you want with strHostInfo, I put it in a DB myself, but whatever.