We we successful in extracting the data from twitter but we couldn't save it on our system using flume.Can you please explain
you might have problem in channel or sink may be that's why u r data is not storing in hdfs.
try to understan this one
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://yourIP:8020/user/flume/tweets/%Y/%M/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
and chek with jps if your data node and namenode is working
Related
Notes:
I have a django model on PostgreSQL having:
raw_data = models.TextField(_("raw_data"), default='')
It just storing some raw data which can be 1K - 200K.
I have 50 millions rows.
Req:
I need to decrease size of data in database.
Questions:
How can I tell what consume most data over all database ?
Should I use a string compression before storing the data ?
2.1 I saw here: Text compression in PostgreSQL that it get compressed anyway, is that true ?
2.2 I did some python code compression, I am not sure if changing bytes to string type can cause in lossing data:
def shrink_raw_data(username):
follower_data = get_string_from_database()
text = json.dumps(follower_data).encode('utf-8') # outputs as a bytes
# Checking size of text
text_size = sys.getsizeof(text)
print("\nsize of original text", text_size)
# Compressing text
compressed = str(zlib.compress(text, 9))
# store String in database
# Checking size of text after compression
csize = sys.getsizeof(compressed)
print("\nsize of compressed text", csize)
# Decompressing text
decompressed = zlib.decompress(compressed)
# Checking size of text after decompression
dsize = sys.getsizeof(decompressed)
print("\nsize of decompressed text", dsize)
print("\nDifference of size= ", text_size - csize)
follower_data_reload = json.loads(decompressed)
print(follower_data_reload == follower_data)
2.3 since my data is stored as a string in the db, is this line "str(zlib.compress(text, 9))" VALID ?
Hello all data pipeline experts!
Currently, I'm about to set up data ingestion from a MQTT source. All my MQTT topics contain float values, except a few ones from RFID scanners contain uuids that should be read in as strings. The RFID topics have a "RFID" in their topic name, specifically, they are of format "/+/+/+/+/RFID".
I would like to transfer all topics EXCEPT the RFID topics to float and store them in an influx db measurement "mqtt_data". The RFID topics should be stored as strings in the measurement "mqtt_string".
Yesterday, I fiddled around a lot with Processors and got no results other than headache. Today, I had a first success:
[[outputs.influxdb_v2]]
urls = ["http://localhost:8086"]
organization = "xy"
bucket = "bucket"
token = "ExJWOb5lPdoYPrJnB8cPIUgSonQ9zutjwZ6W3zDRkx1pY0m40Q_TidPrqkKeBTt2D0_jTyHopM6LmMPJLmzAfg=="
[[inputs.mqtt_consumer]]
servers = ["tcp://127.0.0.1:1883"]
qos = 0
connection_timeout = "30s"
name_override = "mqtt_data"
## Topics to subscribe to
topics = [
"+",
"+/+",
"+/+/+",
"+/+/+/+",
"+/+/+/+/+/+",
"+/+/+/+/+/+/+",
"+/+/+/+/+/+/+/+",
"+/+/+/+/+/+/+/+/+",
]
data_format = "value"
data_type = "float"
[[inputs.mqtt_consumer]]
servers = ["tcp://127.0.0.1:1883"]
qos = 0
connection_timeout = "30s"
name_override = "mqtt_string"
topics = ["+/+/+/+/RFID"]
data_format = "value"
data_type = "string"
as you can see, in the first mqtt_consumer, I left out all topics containing 5 levels of hierarchy. So it would miss those topics. Listing all number of hierarchy levels isn't nice either.
My question would be:
Is there a way to formulate a regex that negates the second mqtt_consumer block, i.e. selecting all topics that are not of the form "+/+/+/+/RFID" ? ... or is there another complete different, more elegant approach I'm not aware of ...
Although I worked before with regex'es, I got stuck at this point. Thanks for any hints to that!!!
Sorry for the unclarity of my first question, I have edited it to be more specific.
Because the output from the middle layers in some neural network is very interesting, I would like to get the output of certain layer during the inference on a micro-controller(MCU) running tf-lite micro c++ library.
The normal way to do this in tensorflow:
# The model we train
model = tf.keras.models.Sequential([
...
])
model.compile(...)
model.fit(...)
# Creat a aux-model which includes the layers until the one we want
layer_output_model = Model(model.inputs, model.layers[theIndexYouWant].outputs)
When we put the model into MCU, we will first quantize/prune the model, convert it into a C array and flash the model to MCU, like this:
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
tflite_model = converter.convert()
open(tflite_mnist_model, "wb").write(tflite_model)
And the inference will be called in c++ like this:
# Initialization
const tflite::Model* model = ::tflite::GetModel(model);
TfLiteTensor* input = interpreter.input(0);
TfLiteTensor* output = interpreter.output(0);
# Give input, run inference and get output
input->data.f[0] = 0.;
TfLiteStatus invoke_status = interpreter.Invoke();
float value = output->data.f[0];
If I want to extract the output of certain middle layer during inference in the MCU, how could I do it?
The only method I can come up with now is convert the above aux-model layer_output_model into c array and upload this as an additional model to MCU.
converter = tf.lite.TFLiteConverter.from_keras_model(layer_output_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
aux_tflite_model = converter.convert()
Is this the right way to do? I'm not sure the aux_model I converted here is the same representation of the wanted model layer output, especially after quantization using representative_dataset
Thanks.
I am trying to read a csv file that is in my S3 bucket. I would like to do some manipulations and then finally convert to a dynamic dataframe and write it back to S3.
This is what I have tried so far:
Pure Python:
Val1=""
Val2=""
cols=[]
width=[]
with open('s3://demo-ETL/read/data.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
print(row)
if ((Val1=="" ) & (Val2=="")):
Val1=row[0]
Val2=row[0]
cols.append(row[1])
width.append(int(row[4]))
else:
continues...
Here I get an error that says it cannot find the file in the directory at all.
Boto3:
import boto3
s3 = boto3.client('s3')
data = s3.get_object(Bucket='demo-ETL', Key='read/data.csv')
contents = data['Body'].read()
print(contents)
for row in content:
if ((Val1=="" ) & (Val2=="")):
Val1=row[0]
Val2=row[0]
cols.append(row[1])
width.append(int(row[4]))
else:
continues...
Here it says index is out of range which is strange because I have 4 comma separated values in the csv file. When I look at the results from the print(contents), I see that its putting each character in a list, instead of it putting each comma separated value in a list.
Is there a better way to read the csv from s3?
I ended up solving this by reading it as a pandas dataframe. I first created an object with boto3, then read the whole object as a pd which I then converted into a list.
s3 = boto3.resource('s3')
bucket = s3.Bucket('demo-ETL')
obj = bucket.Object(key='read/data.csv')
dataFrame = pd.read_csv(obj.get()['Body'])
l = dataFrame.values.tolist()
for i in l:
print(i)
get_object returns the Body response value which is of type StreamingBody. Per the docs, if you're trying to go line-by-line you probably want to use iter_lines.
For example:
import boto3
s3 = boto3.client('s3')
data = s3.get_object(Bucket='demo-ETL', Key='read/data.csv')
file_lines = data['Body'].iter_lines()
print(file_lines)
This probably does more of what you want.
You can use Spark to read the file like this:
df = spark.read.\
format("csv").\
option("header", "true").\
load("s3://bucket-name/file-name.csv")
You can find more options here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv
I am looking for a GEO IP database (Similar to MaxMind GeoLite2 Country and City) that will allow me to identify the US state that the user is coming from in order to target specific content to that user.
Does anyone know how or where I could find such a database/service or solution?
Don't expect high accuracy, unless you're satisfied with country/city precision. It is after all IP based geolocation data and the accuracy varies. It is based on ISP providers or companies that manages the databases(commercial databases may have higher accuracy). Look at an IP location info webtool ( like http://geoipinfo.org/ ) and you'll see approximately where it finds you, and it also provided accuracy at city and country levels - percentage wise. It used the ip2location database for lookups and their precision data.
This thread is old, but as of today, I used http://api.ipstack.com and it's working perfectly. They have a VERY extensive help examples on their site, but basically you make the call, parse the data and get what you want.
First, be sure you have any/all includes (Namespace=System.Xml, System.Net, blah, blah).
Second, be sure not to test on a private network IP (192.168.x.x or 10.x.x.x) because that will always return blank/empty fields and you'll think something is coded wrong.
Third, you will need an Acess_Key from ipstack.com ... you can setup a FREE account (10,000 requests a month I think) and get your access code to enter into your string below for API call. I filled out the form and was up and running in 10 minutes for free.
This worked for me to track visitors to any page:
string IP = "";
string strHostName = "";
string strHostInfo = "";
string strMyAccessKeyForIPStack = "THEYGIVEYOUTHISWHENYOUSETUPFREEACCOUNT";
strHostName = System.Net.Dns.GetHostName();
IPHostEntry ipEntry = System.Net.Dns.GetHostEntry(strHostName);
IPAddress[] addr = ipEntry.AddressList;
IP = addr[2].ToString();
XmlDocument doc = new XmlDocument();
string strMyIPToLocate = "http://api.ipstack.com/" + IP + "?access_key=strMyAccessKeyForIPStack&output=xml";
doc.Load(strMyIPToLocate);
XmlNodeList nodeLstCity = doc.GetElementsByTagName("city");
XmlNodeList nodeLstState = doc.GetElementsByTagName("region_name");
XmlNodeList nodeLstZIP = doc.GetElementsByTagName("zip");
XmlNodeList nodeLstLAT = doc.GetElementsByTagName("latitude");
XmlNodeList nodeLstLON = doc.GetElementsByTagName("longitude");
strHostInfo = "IP is from " + nodeLstCity[0].InnerText + ", " + nodeLstState[0].InnerText + " (" + nodeLstZIP[0].InnerText + ")";
// Then I do what you want with strHostInfo, I put it in a DB myself, but whatever.