Having trouble scraping a specfic site with scrapy properly - python-2.7

I went over the tutorial for Scrapy, and I was able to understand how to scrap the site included in the tutorial. But I'm having a little trouble with some of the more complicated sites (at least to me).
I'm attempting to scrape the rows and columns of the insider transactions from this webpage:
http://finviz.com/insidertrading.ashx
I'm using command prompt commands with scrapy to test out if I'm able to scrape the necessary information, so the following commands are what I've have written in the command prompt.
scrapy shell "http://finviz.com/insidertrading.ashx"
I then used firebug from firefox to look at the html code of the page.
I'm able to get some of the information (Stock Name, Name of the Insider and Date) into a list via this code:
response.css('td a.tab-link::text').extract()
However, the rest of the info is missing.
I'm able to get some (maybe most)of the missing info (Cost, Shares, Value etc) via this code
response.css(td::text).extract()
I can't figure out how to cleanly get all info together in one scrape.
Thanks.
EDIT: The other option would be to collect the data iteratively, one row at a time, so I can separate it as I like. I'm brooding over this as well.

Since the data is tabular, the position of table rows and columns is predictable and stable. You can simply extract all text in the row and unpack it into variables:
for row in response.xpath("//tr[#class='insider-option-row']"):
items = row.xpath('td/a/text() | td/text()').extract()
ticker, owner, relationship, date, transaction, cost, shares, value, shares_total, sec_form_4 = items

Related

Groupby existing attribute present in json string line in apache beam java

I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.

Kibana: can I store "Time" as a variable and run a consecutive search?

I want to automate a few search in one, here are the steps:
Search in Kibana for this ID:"b2c729b5-6440-4829-8562-abd81991e2a0" which will return me a bunch of logs. Of these logs I need to take the first and the last timestamp:
I now would like to store these two data FROM: September 3rd 2019, 21:28:22.155, TO: September 3rd 2019, 21:28:23.524 in 2 variables
Run a second search in Kibana for the word "fail" in between these two variable of time
How to automate the whole process without need of copy/paste and running a second query?
EDIT:
SHORT STORY LONG: I work in a company that produce a software for autonomous vehicles.
SCENARIO: A booking is rejected and we need to understand why.
WHERE IS THE PROBLE: I need to monitor just a few seconds of logs on 3 different machines. Each log is completely separated, there is no relation between the logs so I cannot write a query in discover, I need to run 3 separated queries.
EXAMPLE:
A booking was rejected, so I open Chrome and I search on "elk-prod.myhost.com" for the BookingID:"b2c729b5-6440-4829-8562-abd81991e2a0" and I have a dozen of logs returned during a range of 2 seconds (FROM: September 3rd 2019, 21:28:22.155, TO: September 3rd 2019, 21:28:23.524).
Now I need to know what was happening on the car so I open a new Chrome tab and I search on "elk-prod.myhost.com" for the CarID: "Tesla-45-OU" on the time range FROM: September 3rd 2019, 21:28:22.155, TO: September 3rd 2019, 21:28:23.524
Now I need to know why the server which calculate the matching rejected the booking so I open a new Chrome tab and I search for the word CalculationMatrix always on the time range FROM: September 3rd 2019, 21:28:22.155, TO: September 3rd 2019, 21:28:23.524
CONCLUSION: I want to stop to keep opening Chrome tabs by hand and automate the whole thing. I have no idea around what time the book was made so I first need to search for the BookingID "b2c729b5-6440-4829-8562-abd81991e2a0", then store the timestamp of first and last log and run a second and third query based on those timestamps.
There is no relation between the 3 logs I search so there is no way to filter from the Discover, I need to automate 3 different query.
Here is how I would do it. First of all, from what I understand, you have three different indexes:
one for "bookings"
one for "cars"
one for "matchings"
First, in Discover, I would create three Saved Searches, one per index pattern. Then in Visualize, I would create a Vertical bar chart on the bookings saved search (Bucket X-Axis by date_histogram on the timestamp field, leave the rest as is). You'll get a nice histogram of all your booking events bucketed by time.
Finally, I would create a dashboard and add the vertical bar chart + those three saved searches inside it.
When done, the way I would search according to the process you've described above is as follows:
Search for the booking ID b2c729b5-6440-4829-8562-abd81991e2a0 in the top filter bar. In the bar chart histogram (bookings), you will see all documents related to the selected booking. On that chart, you can select the exact period from when the very first booking document happened to the very last. This will adapt the main time picker at the top and the start/end time will be "remembered" by Kibana
Remove the booking ID from the top filter (since we now know the time range and Kibana stores it). Search for Tesla-45-OU in the top filter bar. The bar histogram + the booking saved search + the matchings saved search will be empty, but you'll have data inside the second list, the one for cars. Find whatever you need to find in there and go to the next step.
Remove the car ID from the top filter and search for ComputationMatrix. Now the third saved search is going to show you whatever documents you need to see within that time range.
I'm lacking realistic data to try this out, but I definitely think this is possible as I've laid out above, probably with some adaptations.
Kibana does work like this (any order is ok):
Select time filter: https://www.elastic.co/guide/en/kibana/current/set-time-filter.html
Add additional criteria for search like for example field s is b2c729b5-6440-4829-8562-abd81991e2a0.
Add aditional criteria for search like for example field x is Fail.
Additionaly you can view surrounding documents https://www.elastic.co/guide/en/kibana/current/document-context.html#document-context
This is how Kibana works.
You can prepare some filters beforehands, save them and then use them if you want to automate the process of discovering somehow.
You can do that in Discover tab in Kibana using New/Save/Open options.
Edit:
I do not think you can achieve what you need in Kibana. As I mentioned earlier one option is to change the data that is comming to Elasticsearch so you can search for it via discover in Kibana. Another option could be builiding for example Java application, that is using Elasticsearch - then you can write algorithm that returns the data that you want. But i think it's a big overhead and I recommend checking the data first.
Edit: To clarify - you can create external Java let's say SpringBoot application that uses Elasticsearch - all the data that you need is inside it.
But in this option you will not use Kibana at all.
You can export the result to csv or what you want in the code.
SpringBoot application can ask ElasticSearch for whatever it needs, then it would be easy to store these time variables inside of Java code.
EDIT: After OP edited question to change it dramatically:
#FrancescoMantovani Well the edited version is very different from where you first posted here How to automate the whole process without need of copy/paste and running a second query? and search for word fail in a single shot. In accepted answer you are still using a three filters one at a time so it is not one search, but three.
What's more if you would use one index, and send data from multiple hosts via filebeat you don't even to have to create this dashboard to do that. Then you can you can select the exact period from when the very first document happened to the very last regarding filter and then remove it and add another filter that you need - it's simple as that. Before you were writing about one query,
How to automate the whole process without need of copy/paste and
running a second query?
not three. And you don't need to open new tab in Chrome each time you want to change filter just organize the data by for example using filebeat as mentioned before.
There is no relation between the 3 logs
From what you wrote the realation exist and it is time.
If the data is in for example three diferent indicies (cause documents don't have much similiar data) you can do it like that:
You change them easily in dicover see:
You can go to discover select index 1 search, select time range that you need, when you change index the time range is still the one you selected, you only need to change filter - you will get what you need.

Mapping user spreadsheet columns to database fields

I’m not sure where to start on this project. I know how to read the contents of the excel spreadsheet, I know how to identify the header row, I know how to loop over the contents. I believe I have the UX portion worked out but I am not sure how to process the data.
I’ve googled and only found .Net solutions but I’m looking for a ColdFusion/Lucee solution.
I have a working form allowing me to map a user's spreasheet column to my database values (this is being kept simple for this post; user does not have direct access to the database).
Now that I have my data, I'm not sure how to loop over the data results. I believe there will be several loops (an outer and an inner). Then of course I also need to loop over the file contents but I think if I can get the headings mapped out,I can figure out the remaining.
Any good links, tutorials, or guides would be greatly appreciated.
Some pseudo code might be enough to get me started.
User uploads form
System reads headers and content.
User is presented form with a list of columns from their uploaded spreadsheet to match with available database fields (eg “column1” matches “customer name”.
User submits form.
Now what?
UPDATED
Here is what the data looks like AFTER the mapping has been done in my form. The column deliiter is the ::: and within the column the ||| indicates the ID associated with the selected column value. I've included the id and the column value since I plan on displaying the mapping again as a confirmation. Having the ID saves a trip to the database.
If I understand correctly, your question is: how do you provide the user a form allowing them to map their spreadsheet columns to that of the database
Since you have their spreadsheet column names, and you have the database column names, then this problem is essentially a UI/UX problem. You need to show both lists, and allow the user to map them. I can imagine several approaches to this. My first thought would be some sort of drag/drop operation, as follows:
Create a list of boxes, one for each field in your database table, and include the field name in (or above) the box. I'll call this the db field list. Then, create another list for each column from the spreadsheet, which I'll call the spreadsheet column list. The user would drag/drop items from the spreadsheet column list to the db field list.
When a mapping has been completed by the user, you would store the column/field names in as data for the DOM element of the db field list box. Then upon submission, you would acquire the mapping data by visiting each box and adding it to an array. Then you would serialize that array into JSON and send that to your form submission handler.
This could be difficult or easy, depending on your knowledge of UI implementations using JavaScript. jQuery makes this easy (if you know jQuery). There's even a jquery UI plugin that does this: https://jqueryui.com/droppable/.
A quick search for javascript drag drop would help, and here's a few articles I found:
https://www.w3schools.com/html/html5_draganddrop.asp
https://medium.com/quick-code/simple-javascript-drag-drop-d044d8c5bed5
You would also need to submit the array of mappings using javascript. You could search for that as well, and here's an article I found:
https://codereview.stackexchange.com/questions/94493/submit-an-array-as-an-html-form-value-using-javascript

In enterprise guide, how do you re open a previous data steps output to view it?

I'm using enterprise guide 4.3.
When you run a data step the resulting output opens in a spreadsheet like table.
Then when you run a proc tabulate or similar, the spreadsheet like view of the data disappears and the table comes up in SAS Report or HTML form etc.
You can then run further commands on that dataset that was created in the data step.
Q. How can you get that spreadsheet like view of the dataset back? (assuming it's possible)
I know you can run the data step again and it will display it but that seems really inefficient, especially if the data step had lots of computations involved. The data is obviously 'sitting there' given you can still interact with it (with proc tabulate etc). I was really surprised to see that it drops off from the process flow view.
Apologises if I've name things poorly above, I'm an R beginning to dabble in SAS.
If I understood you correctly you run some code and the result comes up. Then you run some other piece of code, from the same Code node and the initial result gets removed from the process flow.
You can always find your dataset in the Server List. You can enable it by clicking View -> Server List.
There is also a trick that you can do. When you run your code and the dataset node is created in the process flow, you can do a simple query on it. Just do Right click -> Filter and query and make it do something simple that won't take too long.
Now, when you run your next piece of code, this node will not be replaced (at least this is what happens in EG 4.1).
If you mean viewing the resulting data set from a DATA STEP, choose View/Process Flow and double click on the data set you want to view. Also, within your program, log, data or result view, there should be tabs across the top that allow you to bring up the other items of the process flow.

webservice for autosuggest on city names / postal codes including long-lat coordinates?

i'm looking for a webservice, to be used for an autocomplete field,
where people can fill in either a postal code / city name or both
this service will need all cities in Europe, so we can use it for all country websites.
and in a later stadium we want to keep the world open for asia and america so this would be a plus.
preferably it would also return the long-lat coordinates for the locations,
Now it is a free textfield, after leaving the field, we hit the google geocoding service,
to find coordinates... preferably i would tie these two together.
so we don't have to query 2 services for one thing.
does anyone know of the existance of such a service online somewhere?
or would you suggest to build our own database with cities / postal codes / coordinates?
if so we would need to get the content from somewhere too, and i was trying to avoid that issue :)
I recently searched for a similar service, in vain.
I wanted my users to have auto-complete on entering a city name, and once a city is chosen I needed to pass the name and lat/long onto the Google API. In the end I did this: -
downloaded the geonames allcountries.zip, full extract: this
Imported it into a SQL DB via SSIS (about 7.5 million records!)
Wrote a simple query to extract just the cities (only the PPLC, PPLA and PPLA2 records).
This left me with a manageable table of 9112 records (with lat / long and country code) which covers all the cities in the world. I then wrote my own code to query the data.
Not ideal, but I needed a solution.
I know this post is very old but for thouse who are looking for a simple solution that can be integrated in 5 minutes here is the link:
Geocomplete jQuery...
For my case I followed this steps:
1 - Download the plugin from here.
2 - Add the jquery.geocomplete.js or jquery.geocomplete.min.js file into your javascript folder of your project.
3 - Call this file in script tags on the html page where you have the input field that you have to autocomplete with cities:
<script src='/PathToTheFile/jquery.geocomplete.js'></script>
4 - To convert an input into an autocomplete field, simply call the Geocomplete plugin in script tags: <script>
$("#IdOfTheInputField").geocomplete(); // Option 1: Call on element.
$.fn.geocomplete("input"); // Option 2: Pass element as argument.
</script>
5- You can check for the complete list of options on the link provided at the top.
Hope that this helped!