Dealing with a list of tuples - SQLite3 & Python 2.7 - python-2.7

I am using a database to return a couple of values I place there. Let's just say the data is google, yahoo, bing.
The Code
dbCursor.execute('''SELECT ticker FROM SearchEngines''')
allEngines = dbCursor.fetchall()
for engine in engines:
print engine
Yields the following result:
(u'google')
(u'yahoo')
(u'bing)
This is troublesome because I require the result to be appended to a url in a string format. Does anybody know a way around this?
Thanks

fetchall() always returns a tuple, even if you're just selecting one field. So...
for engine in engines:
print engine[0]
Or:
for (engine,) in engines:
print engine
Hope this helps.

Related

ClientError: Unable to parse csv: rows 1-1000, file

I've looked at the other answers to this issue and none of them are helping me. I am trying to run a simple random cut forest algorithm. I have a small data set of IPs which have been stripped down to only have numbers. I still get this error. It only has one column of these numbers. The CSV looks like this:
176162144
176862141
176762141
176761141
176562141
Have you looked at this sample notebook, and tried using it with your own data?
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.ipynb
In a nutshell, it reads the CSV file with Pandas and trains the model like this:
rcf = RandomCutForest(role=execution_role,
train_instance_count=1,
train_instance_type='ml.m4.xlarge',
data_location='s3://{}/{}/'.format(bucket, prefix),
output_path='s3://{}/{}/output'.format(bucket, prefix),
num_samples_per_tree=512,
num_trees=50)
# automatically upload the training data to S3 and run the training job
rcf.fit(rcf.record_set(taxi_data.value.as_matrix().reshape(-1,1)))
You didn't say what your use case was, but as you're working with IP addresses, you may find the IP Insights built-in algorithm useful too: https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights.html
I was utilizing the sample notebook Julien Simon mentioned earlier, but at some point the data was ending up as strings! The funny thing about RCF algorithms is they have to run on integer data.
What I did is I made sure to cast the array as an int array as a double check and vallah! It worked. I am at loss over how the data ended up in a string format but alas, that was the issue. Simple solution.

Get highest value from a list with a lot of useless characters

I am trying to get a value from a cell in Google Sheets which contains a list of values separated with commas.
Example:
UC133 - 2019/01/10 2019/01/30, UC99 - 2018/11/29 2018/12/19, UC134 -
2019/06/01 2019/06/19, UC132 - 2018/12/20 2019/01/09
I would like to be able to get an output in a cell of "UC134", because 134 is "bigger" than UC99, UC132 and UC133.
I tried a lot of different functions and formulas but I am unable to get something to work. I also really tried to fix the original data I get this from, but it seems like it is not an option.
Any help is appreciated and if possible without any function scripts.
Thank you very much for your time and let me know if you have any questions.
=ARRAYFORMULA("UC"&MAX(REGEXEXTRACT(SPLIT(A1, ","), "UC(\d+)\s")*1))
shorter: =ARRAYFORMULA("UC"&MAX(LEFT(SPLIT(A1, "UC"), 3)*1))
longer: =ARRAYFORMULA("UC"&MAX(INDEX(SPLIT(TRANSPOSE(SPLIT(A1, "UC")), " ")),,1))

Kibana search with regular expression not working

I am trying to find some logs in Kibana by using Regular Expressions. I am aware that Kibana doesn't support the "classical" RegEx, but rather Lucene Query Syntax. I have read through the documentation of it (https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl-regexp-query.html#regexp-syntax) and imo my queries should work, but they don't.
Here is an example log entry that I want to target with my query:
Timings are: sync started at 2019-02-12 19:15:09.402; accounts
downloaded:+760ms/760ms; accounts data downloaded:+1221ms/1981ms;
categorization pushed:+0ms/1981ms; categorization
started:+131ms/2112ms; categorization completed:+123ms/2235ms; in
total:2235ms.
What I want to find in the end is all such log entries where the time of "categorization started" exceeds a certain threshold. However my queries fail already while just trying to approach the final query.
I get results when I query:
message:"/categorization started/"
But already when i modify it to:
message:/categorization started/
i get nothing. Any of the following attemps also give nothing:
message:/categorization\sstarted/
message:/.*categorization\sstarted.*/
message:/.*categorization.*started.*/
At this point I'm already lost - why do all these queries not match anything?
In my mind, the final query that should get what I want should be as follows (finding all entries where categorization started time was 10,000ms or more):
message:/.*categorization started:\+<10000-99999>ms.*/
It goes without saying that this of course also returns nothing, which doesn't surprise me when the above queries already fail.
Can anyone explain to me what I am doing wrong?
Thank you
I suggest you to use
message:*categorization started*

Searching a Mongo database using PyMongo, while using regex

I currently have a PyMongo collection with around 100,000 documents. I need to perform a regex search on each of these documents, checking each document against around 1,800 values to see if a particular field (which is an array) contains one of the 1,800 strings. After testing a variety of ways of using regex, such as compiling into a regular expression, multiprocessing and multi-threading, the performance is still abysmal, and takes around 30-45 minutes.
The current regex I'm using to find the value at the end of the string is:
rgx = re.compile(string_To_Be_Compared + '$')
And then this is ran using a standard pymongo find query:
coll.find( { 'field' : rgx } )
I was wondering if anyone had any suggestions for querying these values in a more optimal way? Ideally the search to return all the values should take less than 5 minutes. Would the best course of action to be use something like ElasticSearch or am I missing something basic?
Thanks for you time

How to delete a url in each string from a dataset

I have a dataset in which 1 column has the tweets and other column has labels for the tweets. My problem is I want the html links present in the tweets to be removed for example
RT #AmDiabetesAssn: Know what’s scary? These #diabetes statistics. Spread awareness this November for #DiabetesMonth! http://t.co/qIiiSc4ozZ
I have a tweet as given above i want to remove(http://t.co/qIiiSc4ozZ) and want the output in this way, for all the strings.
RT #AmDiabetesAssn: Know what’s scary? These #diabetes statistics. Spread awareness this November for #DiabetesMonth!
I have seen many examples and tried those but couldn't get the desired result. Please help. Thanks in advance.
I tried this, which should work for any links that don't have spaces in them:
for tweet in tweets:
print re.sub(r'http://\S+\s?','',tweet)
I assume here that you've got a bunch of strings in the tweets array that represent the first column that you described above (also that you want them printed). You should be able to modify to suit the iteration pattern you're using.