Is there a way to automatically download the excel files from the Interactive Data webpage within EDGAR for a list of tickers without having to manually search for each ticker on EDGAR? Or is there a way to get to the XBRL for a range of companies without having to again physically go to each page within EDGAR? I am having trouble with this because I cannot figure out how to generate a unique URL given the last six numbers have to do with the sequence of the filing for that year and account number.
Edgar does not have an API. I wrote a package that serves as an interface to Edgar that allows searching by tickers. The bit that parses the search results page is below. The whole file is at https://github.com/andrewkittredge/financial_fundamentals/blob/master/financial_fundamentals/edgar.py.
SEARCH_URL = ('http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&'
'CIK={symbol}&type={filing_type}&dateb=&owner=exclude&count=100')
def _get_document_page_urls(symbol, filing_type):
'''Get the edgar filing document pages for the CIK.
'''
search_url = SEARCH_URL.format(symbol=symbol, filing_type=filing_type)
search_results_page = get_edgar_soup(url=search_url)
xbrl_rows = [row for row in
search_results_page.findAll('tr') if
row.find(text=re.compile('Interactive Data'))]
for xbrl_row in xbrl_rows:
documents_page = xbrl_row.find('a', {'id' : 'documentsbutton'})['href']
documents_url = 'http://sec.gov' + documents_page
yield documents_url
Related
I am building a data processing app which allows user to upload multiple excel sheets, which is processed and returned back to user. I have created 3 models - 1st model is overall, 2nd model is capturing individual excel workbooks, in the 3rd model excel workbooks are opened and data for each excel sheet (sheets within workbooks) is captured. I need advise on 3 things:
Is my model structure efficient given that users will be uploading multiple excel sheets?
Given that users may do upload multiple times in a day, how do I retrieve the latest batch of files for processing?
I need to take user inputs against each sheet uploaded by user (3rd model) in a single view while parallelly showing a preview of the table to the user?
Please help with your opinions.
class UserUpload(models.Model):
number_of_workbooks = models.IntegerField(blank=True,null=True)
file_size = models.FloatField(blank=True,null=True)
user = models.ForeignKey(User,on_delete=models.CASCADE,null=False,blank=False)
multiple_sheets = models.BooleanField(blank=False,default=False)
var_1 = models.BooleanField(blank=False,default=False)
class FileUpload(models.Model):
file_field = models.FileField(blank=False,upload_to=user_directory_path)
userupload = models.ForeignKey(UserUpload,related_name='user_uploads',on_delete=models.CASCADE)
class FileSheetData(models.Model):
fileupload = models.ForeignKey(FileUpload,related_name='file_sheets',on_delete=models.CASCADE)
sheetname = models.CharField(blank=False,max_length=256)
var_2 = models.BooleanField(blank=False,default=False)
var_3 = models.PositiveIntegerField(blank=False,default=0)
After doing a bit of research I was able to get answers to some of the problems above -
In case you want to allow multiple file uploads you would need to create a separate model and link it with the parent model through foreign key. So my model structure for multiple file uploads is fine.
In my example, I have 3 models which are sequentially linked by a Foreign Key. Instances in FileSheetData can be accessed by following the relationship backwards. Here is the link to the documentation - https://docs.djangoproject.com/en/3.2/topics/db/queries/#backwards-related-objects
I am still struggling with this but understand as of now that ModelFormSet could be a potential way to solve it but I am working on it.
I am using Django restframework and I am trying to export excel. My issue is the process is take a lot of time till it generates the excel file.
The final file have about 1MB with 20k lines and the generation time is about 8 minutes and this does not seem right.
here is the view:
class GenerateExcelView(APIView):
filename = 'AllHours.xlsx'
wb = Workbook()
ws = wb.active
ws.title = "Workbook"
data = Report.objects.all()
row_couter = 2
for line in data:
first_name = line.employee_id
second_name = line.employee_name
age = line.description
...
ws['A{}'.format(row_counter)] = first_name
ws['B{}'.format(row_counter)] = second_name
ws['C{}'.format(row_counter)] = age
...
row_counter +=1
response = HttpResponse(save_virtual_workbook(wb), content_type='application/ms-excel')
response["Content-Disposition"] = 'attachment; filename="' + filename + '"'
return response
There are few more columns... Is it possible to change the process so it is a bit faster?
EDIT: I had wrong indentation of the loop.
It tends to help a lot with performance to use prefetch_related on the queryset.
Given a Table with a 100 rows each row having a foreign key to another table in you example the employee. Your loop would fetch the report then for each of the 100 rows the used relations. This is due to the lazy nature of the django ORM. As you can see we are already on at least 100 Queries... not so great.
If you would use:
data = Report.objects.all().prefetch_related('employee')
It would use one db query in stead of a hundred.
That should improve the speed of your solution by quite a bit already.
see more: https://docs.djangoproject.com/en/3.1/ref/models/querysets/#prefetch-related
I have been wrestling with the same problem, and even after refactoring into raw SQL there is little improvement. The issue is the speed of openpyxl.
Their documentation suggests that using write-only mode helps, but I found it to be a small improvement at best: my benchmark on a report with 2 tabs, 18k rows on the second tab, showed a 50% reduction after the query refactor to SQL plus an openpyxl refactor to use write-only mode (which is a pain if you are doing cell formatting or special rows like headers and totals).
You can check their performance page here: https://openpyxl.readthedocs.io/en/stable/performance.html
... but I wouldn't get your hopes up.
I'm currently working on a project using Google form to gather data into an excel sheet, and then organize the data based on answers of the form. After an employee selects a department and submits a form, I would like to filter each response by department to its own page within the workbook.
I've listed my current formula below. As of now, it will only import the data on the same row when it is sorted into a new page, thus leaving many blank spaces in between information. Is there a way to make sure it will not skip rows during the import?
=IF(REGEXMATCH('Form Responses 1'!D2:D1101, "Department Name"), 'Form Responses 1'!B2:B1101," ")
try:
=FILTER('Form Responses 1'!B2:B1101,
REGEXMATCH('Form Responses 1'!D2:D1101, "Department Name"))
I am pretty new to python/django, but previously programmed in C# and VB.Net. I need to create a list on my django form that contains data elements from a PGSQL table. The user should be able to select multiple rows and then I would formulate a query that I would send back to the server. I have researched, but have not found anything to point me in the right direction. I will show my code below.
forms.py
PROD_MULTI_RECORDS=[
('darryl.dillman','darryl.dillman'),
('richard.mcgarry','richard.mcgarry'),
('janet.delage','janet.delage')
]
class rdsprodform(forms.Form):
selectmulti =
forms.CharField(widget=forms.SelectMultiple(choices=PROD_MULTI_RECORDS),
required = False, label = "Please select records to process")
I am hoping there is a solution for my issue. I am trying to produce a call to a url in javascript from a path that looks like this:
/api/vendor_filter/?city=&zipCode=29615&dateServing=04%2F21%2F2018
I have set the filter of the call to call data from date and zip code.
What I am hoping to achieve is because this call already populates the date as current date and zip code from a registered device zip code that the zip code in this case 29615 actually pulls a range from 29601, 29627 so that any zip code in that range is retrieved.
This would be the same as stating list(range(29601,29627+1)) so that the user zip code pulls all zip code items in the database in that range.
The zip code could be as above or it could be 68011 in which case I would want a range of (67998, 68024) so regardless of the users zip code in the call the zip code is represented in the middle of the range,
Change your filter by this:
class VendorFilter(django_filters.FilterSet):
zipCode = django_filters.NumericRangeFilter(queryset=VendorListing.objects.all())
class Meta:
model = VendorListing
fields = ['dateServing','city', 'zipCode']
You can use the filters from django_filters app. Read the docs