I'm trying to open a warc file with python using the toolbox from the following link:
http://warc.readthedocs.org/en/latest/
When opening the file with:
import warc
f = warc.open("00.warc.gz")
Everything is fine and the f object is:
<warc.warc.WARCFile instance at 0x1151d34d0>
However when I'm trying to read everything in the file using:
for record in f:
print record['WARC-Target-URI'], record['Content-Length']
The following error appears:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 390, in __iter__
record = self.read_record()
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 373, in read_record
header = self.read_header(fileobj)
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 331, in read_header
raise IOError("Bad version line: %r" % version_line)
IOError: Bad version line: 'WARC/0.18\n'
Is this because my warc file version is not supported by the warc toolbox I'm using or something else?
ClueWeb09 dataset is available in the WARC 0.18 format. However, it has several issues. Some records are malformed.
The most prevalent problem is an extra newline in the WARC header. There are a few cases of other malformed headers also.
Moreover, it does not use the standard \r\n end-of-line markers which is actually your problem.
warc-clueweb library can handle it. This is a special python library to work with ClueWeb09 WARC files. According to documentation
Only minor modifications to the original library were made. The original documentation of the warc library still holds
Yes, thanks for #eyelash explanation about this problem.
Actually some records in Clueweb-09 are malformed. But the official warc library and the above recommended git repo warc-clueweb library both have some issues.
This fork repo could not handle Clueweb12 dataset and another issue is that it could miss 1-2 document when dealing every .warc.gz file.
So I've changed a little code to support both Clueweb09 and Cluewe12 datasets. Here is my repo which has been tested on 100 billion pages, my warc tools forked and changed from warc-clueweb library and official repo.
Related
I am currently using Python 2.7 and my OS is Windows 7. While attempting to use the Bloomberg API I am getting this error:
Traceback (most recent call last):
File "datagrab.py", line 1, in <module>
import blpapi, time, json
File "C:\Python27\lib\blpapi\__init__.py", line 5, in <module>
from .internals import CorrelationId
File "C:\Python27\lib\blpapi\internals.py", line 50, in <module>
_internals = swig_import_helper()
File "C:\Python27\lib\blpapi\internals.py", line 42, in swig_import_helper
import _internals
ImportError: No module named _internals
I have set my path variable to point to blpapi3_64.dll and also updated my bloomberg terminal. I have also moved the local blpapi API to a different directory but still the problem exists.
I am kind of new to this API in general. So can someone please guide me?
Thank you in advance!
From your question is sounds like maybe you have tried this, but just outlining one possible solution from the README in the Python Supported Release release available here.
Note that many Python installations add the current directory to the
module search path. If the Python interpreter is invoked from the
installer directory, such a configuration will attempt to use the
(incomplete) local blpapi directory as a module. If the above
import line fails with the message Import Error: No module named
_internals, move to a different directory before invoking python.
I know this question is a bit stale, but in case people end up here like me. Do you have the C++ version of blpapi? it is a requirement for the python api as mentioned here: https://www.bloomberg.com/professional/support/api-library/
so download the C++ zip installer, extract somewhere, and then add it as an environment variable so that the python api can find it:
Environment variable name: BLPAPI_ROOT
Value: C:\blp\blpapi_cpp_3.8.18.1 (THIS IS WHERE MINE IS INSTALLED, YOUR VALUE HERE MAY BE DIFFERENT)
Hope that helps!
I had already had sublime and when setting up new plug-ins, I found problem for the setup of SublimeClang. It requires libClang but I searched all my folders using the " locate clang" command and couln't find libclang at last (The doc on the web said it should be located in usr/lib/x86_64-linux-gnu/).
When i open the sublime, the console printed out like this:
Traceback (most recent call last):
File "/home/meng/.config/sublime-text-3/Packages/SublimeClang/internals/clang/cindex.py", line 95, in get_cindex_library
return cdll.LoadLibrary(filename)
File "./ctypes/__init__.py", line 431, in LoadLibrary
File "./ctypes/__init__.py", line 353, in __init__
OSError: libclang.so: cannot open shared object file: No such file or directory
error: It looks like libclang.so couldn't be loaded. You have to compile it yourself, or download from https://github.com/quarnster/SublimeClang/downloads.
Please note that this plugin uses features from clang 3.0 so make sure that is the version you have installed.
Once you have the file, you need to copy libclang.so into the root of this plugin. See http://github.com/quarnster/SublimeClang for more details.
I'm rookie in ubuntu OS and hope someone can give me help. Many thanks!
When I try to use the Google Speech Rec API I get this error message. Any help?
dyld: Library not loaded: /usr/local/Cellar/flac/1.3.1/lib/libFLAC.8.dylib
Referenced from: /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/speech_recognition/flac-mac
Reason: image not found
I'm using PyCharm.
I have tried copy pasting and uninstalling and reinstalling but to no avail. HELP :) My whole project is to get the user to say something, and have google translate translate it and have it say the answer. I have the translating and speaking covered, but the Speech Recognition is what I am having trouble with now. Thanks in advance
Here's more error messages.
Traceback (most recent call last):
File >"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line >162, in _run_module_as_main
"main", fname, loader, pkg_name)
File >"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", >line >72, in _run_code
exec code in run_globals
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site->packages/speech_recognition/main.py", line 12, in
audio = r.listen(source)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site->packages/speech_recognition/init.py", line 264, in listen
buffer = source.stream.read(source.CHUNK)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site->packages/pyaudio.py", line 605, in read
return pa.read_stream(self._stream, num_frames)
IOError: [Errno Input overflowed] -9981
First, you must have homebrew installed.
Second, after homebrew is installed, you'll want to install flac:
brew install flac
Figured it out - I just forgot to install Homebrew
Getting this error while using epub python package or you can say epub library for python , wondering what to do about it. please help..
Traceback (most recent call last):
File "F:/4th semester/3", line 4, in <module>
book=epub.open_epub('d:\welcome.epub')
File "C:\Python27\lib\site-packages\epub\__init__.py", line 43, in open_epub
return EpubFile(filename, mode)
File "C:\Python27\lib\site-packages\epub\__init__.py", line 82, in __init__
self._init_read()
File "C:\Python27\lib\site-packages\epub\__init__.py", line 143, in _init_read
self.toc = ncx.parse_toc(self.read_item(item_toc))
File "C:\Python27\lib\site-packages\epub\__init__.py", line 276, in read_item
return self.read(os.path.join(self.content_path, path))
File "C:\Python27\lib\zipfile.py", line 931, in read
return self.open(name, "r", pwd).read()
File "C:\Python27\lib\zipfile.py", line 957, in open
zinfo = self.getinfo(name)
File "C:\Python27\lib\zipfile.py", line 905, in getinfo
'There is no item named %r in the archive' % name)
KeyError: "There is no item named u'OEBPS\\toc.ncx' in the archive"enter code here
From your question I presume you are using Python-Epub library from here: https://pypi.python.org/pypi/epub/0.5.1 and you are running in Windows.
It helps to know that EPUBs are essentially zip files. A typical bug in Python EPUB-handling libraries is attempting to build paths inside zip archive with os.path.join as if it were regular file system. On Windows os.path.join squeezes Windows file path separators (i.e., \\) that are not recognized by zipfile module.
This is a bug in epub library (which should be reported) yet you can easily get a work-around as following:
Determine where your epub sources are located:
python -c "import epub; print epub.__file__"
Add the following function to epub sources:
def zip_path_join(a, *p):
for b in p:
a += '/' + b
return a
Search epub sources for os.path.join and replace it with zip_path_join
Enjoy!
Thanks for the report of the issue. This is... well... shame on me, I should have fixed that a long time ago.
So, I've pushed a new version 0.5.2, and you can upgrade your version and see if it works as you expect (it should, yet I didn't run unit-test on any Windows env).
PS: I won't say "I got a life and stuff happen", but... yeah, that's just it...
This is the output of buildozer:
buildozer android debug
# Check configuration tokens
# Ensure build layout
# Check configuration tokens
# Preparing build
# Check requirements for android
# Install platform
# Apache ANT found at /root/.buildozer/android/platform/apache-ant-1.8.4
# Android SDK found at /root/.buildozer/android/platform/android-sdk-21
# Android NDK found at /root/.buildozer/android/platform/android-ndk-r9c
# Android packages already installed.
# Check application requirements
# Compile platform
# Distribution compiled.
# Build the application #1
# Package the application
Traceback (most recent call last):
File "/bin/buildozer", line 5, in <module>
run()
File "/usr/lib/python2.7/site-packages/buildozer/__init__.py", line 1215, in run
Buildozer().run_command(sys.argv[1:])
File "/usr/lib/python2.7/site-packages/buildozer/__init__.py", line 842, in run_command
self.target.run_commands(args)
File "/usr/lib/python2.7/site-packages/buildozer/target.py", line 85, in run_commands
func(args)
File "/usr/lib/python2.7/site-packages/buildozer/target.py", line 97, in cmd_debug
self.buildozer.build()
File "/usr/lib/python2.7/site-packages/buildozer/__init__.py", line 178, in build
self.target.build_package()
File "/usr/lib/python2.7/site-packages/buildozer/targets/android.py", line 397, in build_package
version = self.buildozer.get_version()
File "/usr/lib/python2.7/site-packages/buildozer/__init__.py", line 554, in get_version
' (looking for `{1}`)'.format(fn, regex))
Exception: Unable to find capture version in ./main.py
(looking for `__version__ = '(.*)'`)
I'm trying to compile a simple probability calculator I designed. I can't post the code, because I'm going to try to publish it. However, I'm willing to answer any questions I need to to get this to work.
Judging by the output of buildozer, I think it's looking for a line in main.py that I didn't know I needed. Unfortunately, I don't have any idea what that line would look like. However, in buildozer.spec, there is a line that says this:
version.regex = __version__ = '(.*)'
version.filename = %(source.dir)s/main.py
The first line looks like the line in the output and the second references the main.py file. Does anyone know what these lines mean? I am new to buildozer, so I'm not quite sure what to do here. Thanks in advance for your help.
By default, buildozer looks for a line in your main.py of the form __version__ = 'something'. This is used to set the apk version, a required field.
You can either add this line to your main.py, or comment out the version check and uncomment the alternative version method on the next lines of buildozer.spec. This lets you set the version string in buildozer.spec itself.
Add version = '0.1' at the top of your main.py file so you can package your application without any error.