I'm trying to embed jetty 9.4.21.v20190926. But it's show to much unnessary log like this:
Server#239963d8{STARTED}[9.4.21.v20190926] - STARTED
+= QueuedThreadPool[qtp1268447657]#4b9af9a9{STARTED,8<=8<=200,i=5,r=4,q=0}[ReservedThreadExecutor#46daef40{s=0/4,p=0}] - STARTED
| += ReservedThreadExecutor#46daef40{s=0/4,p=0} - STARTED
| +> threads size=8
| +> 17 qtp1268447657-17 IDLE TIMED_WAITING # sun.misc.Unsafe.park(Native Method)
| +> 12 qtp1268447657-12-acceptor-0#f5bee51-ServerConnector#7de26db8{HTTP/1.1,[http/1.1]}{0.0.0.0:23689} ACCEPTING RUNNABLE # sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) prio=3
sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method)
+= ServerConnector#7de26db8{HTTP/1.1,[http/1.1]}{0.0.0.0:23689} - STARTED
| +~ Server#239963d8{STARTED}[9.4.21.v20190926] - STARTED
| +~ QueuedThreadPool[qtp1268447657]#4b9af9a9{STARTED,8<=8<=200,i=5,r=4,q=0}[ReservedThreadExecutor#46daef40{s=0/4,p=0}] - STARTED
| += ScheduledExecutorScheduler#12f41634{STARTED} - STARTED
| +- org.eclipse.jetty.io.ArrayByteBufferPool#13c27452
| += HttpConnectionFactory#7637f22[HTTP/1.1] - STARTED
| | +- HttpConfiguration#262b2c86{32768/8192,8192/8192,https://:0,[]}
| | +> customizers size=0
| | +> formEncodedMethods size=2
| | | +> POST
| | | +> PUT
| | +> outputBufferSize=32768
| | +> MANY THINGs......
| | +> MANY THINGs......
+= ErrorHandler#2de8284b{STARTED} - STARTED
+= DefaultSessionIdManager#17d0685f{STARTED}[worker=node0] - STARTED
| += HouseKeeper#67b92f0a{STARTED}[interval=660000, ownscheduler=true] - STARTED
+> sun.misc.Launcher$AppClassLoader#12a3a380
+> URLs size=18
| +> file:~lib/jetty-io-9.4.21.v20190926.jar
| +> MANY THINGs......
| +> file:~lib/javax.servlet-api-3.1.0.jar
| +> MANY THINGs......
| +> MANY THINGs......
| +> file:~build/classes/
+> sun.misc.Launcher$ExtClassLoader#396e2f39
+> URLs size=12
+> file:/C:/Program%20Files/Java/jdk1.8.0_211/jre/lib/ext/access-bridge-64.jar
+> file:/C:/Program%20Files/Java/jdk1.8.0_211/jre/lib/ext/cldrdata.jar
+> file:/C:/Program%20Files/Java/jdk1.8.0_211/jre/lib/ext/jfxrt.jar
+> file:/C:/Program%20Files/Java/jdk1.8.0_211/jre/lib/ext/localedata.jar
+> file:/C:/Program%20Files/Java/jdk1.8.0_211/jre/lib/ext/nashorn.jar
+> file:/C:/Program%20Files/Java/jdk1.8.0_211/jre/lib/ext/sunec.jar
+> file:/C:/Program%20Files/Java/jdk1.8.0_211/jre/lib/ext/sunjce_provider.jar
+> file:/C:/Program%20Files/Java/jdk1.8.0_211/jre/lib/ext/sunmscapi.jar
+> file:/C:/Program%20Files/Java/jdk1.8.0_211/jre/lib/ext/sunpkcs11.jar
+> file:/C:/Program%20Files/Java/jdk1.8.0_211/jre/lib/ext/zipfs.jar
This hurts my eyes really pain. Is there anyway too reduce log like below?
2020-01-12 09:30:36.323:INFO::main: Logging initialized #285ms to org.eclipse.jetty.util.log.StdErrLog
2020-01-12 09:30:37.139:INFO:oejs.Server:main: jetty-9.4.21.v20190926; built: 2019-09-26T16:41:09.154Z; git: 72970db61a2904371e1218a95a3bef5d79788c33; jvm 1.8.0_211-b12
2020-01-12 09:30:37.307:INFO:oejs.session:main: DefaultSessionIdManager workerName=node0
2020-01-12 09:30:37.307:INFO:oejs.session:main: No SessionScavenger set, using defaults
2020-01-12 09:30:37.311:INFO:oejs.session:main: node0 Scavenging every 660000ms
2020-01-12 09:30:37.390:INFO:oejsh.ContextHandler:main: Started o.e.j.s.ServletContextHandler#42f93a98{/,null,AVAILABLE}
2020-01-12 09:30:37.762:INFO:oejs.AbstractConnector:main: Started ServerConnector#7de26db8{HTTP/1.1,[http/1.1]}{0.0.0.0:23689}
2020-01-12 09:30:37.762:INFO:oejs.Server:main: Started #1736ms
I only add jetty.jar first then add all necessary jar via error log when run.
As you seen, it's too much useless log that even made me to write more to post my question.
This is a jetty server dump. By default this is not displayed after startup but this can be configured through the server. In embedded code you can use server.setDumpAfterStart(true/false) to enable/disable this.
Related
I would like to know how to monitor a specific program (with its pid) and get a report of it's RAM used, like with perf record -p <PID> sleep 15 && perf report, giving me instruction using the most of the memory.
I already heard about top commands, but it is not what I want.
Massif is a heap profiler included in the valgrind suite, and can provide some of this information.
Start it with valgrind --tool=massif <you program>. This will create a massif.out file that contains various "snapshots" of heap memory usage while the program ran. A simpler viewer ms_print is included and will dump all the snapshots with stack traces.
For example:
83.83% (10,476B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->30.03% (3,752B) 0x4E6079B: _nl_make_l10nflist (l10nflist.c:241)
| ->24.20% (3,024B) 0x4E608E7: _nl_make_l10nflist (l10nflist.c:285)
| | ->12.10% (1,512B) 0x4E5A091: _nl_find_locale (findlocale.c:218)
| | | ->12.10% (1,512B) 0x4E5978B: setlocale (setlocale.c:340)
| | | ->12.10% (1,512B) 0x4016BA: main (sleep.c:106)
| | |
| | ->12.10% (1,512B) 0x4E608E7: _nl_make_l10nflist (l10nflist.c:285)
| | ->09.41% (1,176B) 0x4E5A091: _nl_find_locale (findlocale.c:218)
| | | ->09.41% (1,176B) 0x4E5978B: setlocale (setlocale.c:340)
| | | ->09.41% (1,176B) 0x4016BA: main (sleep.c:106)
| | |
| | ->02.69% (336B) 0x4E608E7: _nl_make_l10nflist (l10nflist.c:285)
| | ->02.69% (336B) 0x4E5A091: _nl_find_locale (findlocale.c:218)
| | ->02.69% (336B) 0x4E5978B: setlocale (setlocale.c:340)
| | ->02.69% (336B) 0x4016BA: main (sleep.c:106)
| |
| ->05.83% (728B) 0x4E5A091: _nl_find_locale (findlocale.c:218)
| ->05.83% (728B) 0x4E5978B: setlocale (setlocale.c:340)
| ->05.83% (728B) 0x4016BA: main (sleep.c:106)
Check pmap :
pmap <PID>
With pmap , you can see all of resources which using by process. And in here there are many other techniques.
According to my assignment admin must be able to create Polls with Questions (create, delete, update) and Choices related to this questions. All of this should be displayed and changable on the same admin page.
Poll
|
|_question_1
| |
| |_choice_1(text)
| |
| |_choice_2
| |
| |_choice_3
|
|_question_2
| |
| |_choice_1
| |
| |_choice_2
| |
| |_choice_3
|
|_question_3
|
|_choice_1
|
|_choice_2
|
|_choice_3
Ok, it's not a problem to display one level of nesting like so on
class QuestionInline(admin.StackedInline):
model = Question
class PollAdmin(ModelAdmin):
inlines = [
QuestionInline,
]
But how to do to get the required poll design structure?
Check out this library it should provide the functionality.
Been at this for a few days and any help is greatly appreciated.
Background:
I am attempting to create 1+ glue crawlers to crawl the following S3 "directory" structure:
.
+-- _source1
| +-- _item1
| | +-- _2019 #year
| | | +-- _08 #month
| | | | +-- _30 #day
| | | | | +-- FILE1.csv #files
| | | | | +-- FILE2.csv
| | | | +-- _31
| | | | | +-- FILE1.csv
| | | | | +-- FILE2.csv
| | | +-- _09
| | | | +-- _01
| | | | +-- _02
| +-- _item2
| | +-- _2019
| | | +-- _08
| | | | +-- _30
| | | | +-- _31
| | | +-- _09
| | | | +-- _01
| | | | +-- _02
+-- _source2
| +-- ....
........ # and so on...
This goes on for several sources, each with potentially 30+ items, each of which has the year/month/day directory structure within.
All files are CSVs, and files should not change once they're in S3. However, the schemas for the files within each item folder may have columns added in the future.
2019/12/01/FILE.csv may have additional columns compared to 2019/09/01/FILE.csv.
What I've Done:
In my testing so far, crawlers created at source level directories (see above) have worked perfectly as long as no CSV only contains string-type columns.
This is due to the following restriction, as stated in the AWS docs:
The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header.
Normally, I'd imagine you could get around this by creating a custom classifier that expects a certain CSV schema, but seeing as I may have 200+ items (different schemas) to crawl, I'd like to avoid this.
Proposed Solutions:
Ideally, I'd like to force my crawlers to interpret the first row of
every CSV as a header, but this doesn't seem possible...
Add a dummy INT column to every CSV to force my crawlers to read the CSV headers, and delete/ignore the column down the pipeline. (Seems very hackish)
Find another file format that works (will require changes throughout my ETL pipeline)
DON'T USE GLUE
Thanks again for any help!
Found the issue: Turns out in order for an updated glue crawler classifier to take effect, a new crawler must be created and have the updated classifier applied. As far as I can tell this is not explicitly mentioned in the AWS docs, and I've only seen mention of it over on github
Early on in my testing I modified an existing csv classifier that specifies "Has Columns", but never created a new crawler to apply my modified classifier to. Once I created a new crawler and applied the classifier, all data catalog tables were created as expected regardless of column types.
TL;DR: Modified classifiers will not take effect unless they are applied to a new crawler. Source
So basicly my project is setup like this:
Environment
| -- API
| | -- Project1
| | | -- API_one
| | | -- API_one_Project1.py
| | -- Project2
| | | -- API_one
| | | -- API_one_Project2.py
| | API_one.py
| -- External_scripts
| | -- external_script.py
| framework.py
API_one.py
exec("from API.%s.API_one.API_one_%s import *" % (project, project))
If an external scripts wants to use API_one.py it just imports it, and API_one.py handles which actual implementation is going to be imported based on project variable which is injected.
I have a function defined in API_one_Project1.py which takes several input arguments, does some work and then returns a value:
API_one_Project1.py
def foo(argument=None):
if argument:
argument += 1
return argument
And when I want to use foo from external script and I pass an argument, I never enter if statement:
external_script.py
import API_one as one
one.foo(argument=3)
Could anyone explain to me what is going on?
I figured out how to read files into my pyspark shell (and script) from an S3 directory, e.g. by using:
rdd = sc.wholeTextFiles('s3n://bucketname/dir/*')
But, while that's great in letting me read all the files in ONE directory, I want to read every single file from all of the directories.
I don't want to flatten them or load everything at once, because I will have memory issues.
Instead, I need it to automatically go load all the files from each sub-directory in a batched manner. Is that possible?
Here's my directory structure:
S3_bucket_name -> year (2016 or 2017) -> month (max 12 folders) -> day (max 31 folders) -> sub-day folders (max 30; basically just partitioned the collecting each day).
Something like this, except it'll go for all 12 months and up to 31 days...
BucketName
|
|
|---Year(2016)
| |
| |---Month(11)
| | |
| | |---Day(01)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| | |---Day(02)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| |---Month(12)
|
|---Year(2017)
| |
| |---Month(1)
| | |
| | |---Day(01)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| | |---Day(02)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| |---Month(2)
Each arrow above represents a fork. e.g. I've been collecting data for 2 years, so there are 2 years in the "year" fork. Then for each year, up to 12 months max, and then for each month, up to 31 possible day folders. And in each day, there will be up to 30 folders just because I split it up that way...
I hope that makes sense...
I was looking at another post (read files recursively from sub directories with spark from s3 or local filesystem) where I believe they suggested using wildcards, so something like:
rdd = sc.wholeTextFiles('s3n://bucketname/*/data/*/*')
But the problem with that is it tries to find a common folder among the various subdirectories - in this case there are no guarantees and I would just need everything.
However, on that line of reasoning, I thought what if I did..:
rdd = sc.wholeTextFiles("s3n://bucketname/*/*/*/*/*')
But the issue is that now I get OutOfMemory errors, probably because it's loading everything at once and freaking out.
Ideally, what I would be able to do is this:
Go to the sub-directory level of the day and read those in, so e.g.
First read in 2016/12/01, then 2016/12/02, up until 2012/12/31, and then 2017/01/01, then 2017/01/02, ... 2017/01/31 and so on.
That way, instead of using five wildcards (*) as I did above, I would somehow have it know to look trough each sub-directory at the level of "day".
I thought of using a python dictionary to specify the file path to each of the days, but that seems like a rather cumbersome approach. What I mean by that is as follows:
file_dict = {
0:'2016/12/01/*/*',
1:'2016/12/02/*/*',
...
30:'2016/12/31/*/*',
}
basically for all the folders, and then iterating through them and loading them in using something like this:
sc.wholeTextFiles('s3n://bucketname/' + file_dict[i])
But I don't want to manually type out all those paths. I hope this made sense...
EDIT:
Another way of asking the question is, how do I read the files from a nested sub-directory structure in a batched way? How can I enumerate all the possible folder names in my s3 bucket in python? Maybe that would help...
EDIT2:
The structure of the data in each of my files is as follows:
{json object 1},
{json object 2},
{json object 3},
...
{json object n},
For it to be "true json", it either just needed to be like the above without a trailing comma at the end, or something like this (note square brackets, and lack of the final trailing comma:
[
{json object 1},
{json object 2},
{json object 3},
...
{json object n}
]
The reason I did it entirely in PySpark as a script I submit is because I forced myself to handle this formatting quirk manually. If I use Hive/Athena, I am not sure how to deal with it.
Why dont you use Hive, or even better, Athena? These will both deploy tables ontop of file systems, to give you access to all the data. Then you can capture this in to Spark
Alternatively, I believe you can also use HiveQL in Spark to set up a tempTable ontop of your file system location, and it'll register it all as a Hive table which you can execute SQL against. It's been a while since I've done that, but it is definitely do-able