nutch 1.16 parsechecker issue with file:/directory/ inputs

nutch 1.16 parsechecker issue with file:/directory/ inputs - regex

Building up from nutch 1.16 skips file:/directory styled links in file system crawl , I have been trying (and failing) to get nutch to crawl through different directories and subdirectories on a Windows 10 installation, calling commands with Cygwin.
The file dirs/seed.txt, used to initiate the crawl, contains the following:
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/
file://localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/
Running cat ./dirs/seed.txt | ./bin/nutch normalizerchecker -stdin to check on how Nutch is normalizing (default regex-normalize.xml) yields
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:/localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/
While running cat ./dirs/seed.txt | ./bin/nutch filterchecker -stdin returns:
+file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
+file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/
+file://localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/
Meaning all directories are seen as valid. So far, so good, but then, running the following:
cat ./dirs/seed.txt | ./bin/nutch parsechecker -stdin
yields the same error for all three directories, namely:
Fetch failed with protocol status: notfound(14), lastModified=0
The files in logs also do not really tell me anything of what went wrong, just that it won't read the input no matter what, as the logs only contain a "fetching directory X" message per entry...
So what exactly is going on here? I'll also leave the nutch-site.xml , regex-urlfilter.txt and regex-normalize.xml files, for completeness' sake.
nutch-site.xml :
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>NutchSpiderTest</value>
</property>
<property>
<name>http.robots.agents</name>
<value>NutchSpiderTest,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>http.agent.description</name>
<value>I am just testing nutch, please tell me if it's bothering your website</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
By default Nutch includes plugins to crawl HTML and various other
document formats via HTTP/HTTPS and indexing the crawled content
into Solr. More plugins are available to support more indexing
backends, to fetch ftp:// and file:// URLs, for focused crawling,
and many other use cases.
</description>
</property>
<property>
<name>file.content.limit</name>
<value>-1</value>
<description> Needed to stop buffer overflow errors - Unable to read.....</description>
</property>
<property>
<name>file.crawl.parent</name>
<value>false</value>
<description>The crawler is not restricted to the directories that you specified in the
Urls file but it is jumping into the parent directories as well. For your own crawlings you can
change this behavior (set to false) the way that only directories beneath the directories that you specify get
crawled.</description>
</property>
<property>
<name>parser.skip.truncated</name>
<value>false</value>
<description>Boolean value for whether we should skip parsing for truncated documents. By default this
property is activated due to extremely high levels of CPU which parsing can sometimes take.
</description>
</property>
<!-- the following is just an attempt at using a solution I found elsewhere, didn't work -->
<property>
<name>http.robot.rules.whitelist</name>
<value>file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/</value>
<description>Comma separated list of hostnames or IP addresses to ignore
robot rules parsing for. Use with care and only if you are explicitly
allowed by the site owner to ignore the site's robots.txt!
</description>
</property>
</configuration>
regex-urlfilter.txt:
# The default url filter.
# Better for whole-internet crawling.
# Please comment/uncomment rules to your needs.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip http: ftp: mailto: and https: urls
-^(http|ftp|mailto|https):
# This change is not necessary but may make your life easier.
# Any file types you do not want to index need to be added to the list otherwise
# Nutch will often try to parse them and fail in doing so as it doesnt know
# how to deal with a lot of binary file types.:
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS
#|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov
#|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|asp|ASP|xxx|XXX|yyy|YYY
#|cs|CS|dll|DLL|refresh|REFRESH)$
# skip URLs longer than 2048 characters, see also db.max.outlink.length
#-^.{2049,}
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
+.*(/[^/]+)/[^/]+\1/[^/]+\1/
# For safe web crawling if crawled content is exposed in a public search interface:
# - exclude private network addresses to avoid that information
# can be leaked by placing links pointing to web interfaces of services
# running on the crawling machines (e.g., HDFS, Hadoop YARN)
# - in addition, file:// URLs should be either excluded by a URL filter rule
# or ignored by not enabling protocol-file
#
# - exclude localhost and loop-back addresses
# http://localhost:8080
# http://127.0.0.1/ .. http://127.255.255.255/
# http://[::1]/
#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$)
#
# - exclude private IP address spaces
# 10.0.0.0/8
#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$)
# 192.168.0.0/16
#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
# 172.16.0.0/12
#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
# accept anything else
+.
regex-normalize.txt:
<?xml version="1.0"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- This is the configuration file for the RegexUrlNormalize Class.
This is intended so that users can specify substitutions to be
done on URLs using the Java regex syntax, see
https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
The rules are applied to URLs in the order they occur in this file. -->
<!-- WATCH OUT: an xml parser reads this file an ampersands must be
expanded to & -->
<!-- The following rules show how to strip out session IDs, default pages,
interpage anchors, etc. Order does matter! -->
<regex-normalize>
<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>
<pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&|#|$)</pattern>
<substitution>$4</substitution>
</regex>
<!-- changes default pages into standard for /index.html, etc. into /
<regex>
<pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&|#|$)</pattern>
<substitution>/$3</substitution>
</regex> -->
<!-- removes interpage href anchors such as site.com#location -->
<regex>
<pattern>#.*?(\?|&|$)</pattern>
<substitution>$1</substitution>
</regex>
<!-- cleans ?&var=value into ?var=value -->
<regex>
<pattern>\?&</pattern>
<substitution>\?</substitution>
</regex>
<!-- cleans multiple sequential ampersands into a single ampersand -->
<regex>
<pattern>&{2,}</pattern>
<substitution>&</substitution>
</regex>
<!-- removes trailing ? -->
<regex>
<pattern>[\?&\.]$</pattern>
<substitution></substitution>
</regex>
<!-- normalize file:/// protocol prefix: -->
<!-- keep one single slash (NUTCH-1483) -->
<regex>
<pattern>^file://+</pattern>
<substitution>file:/</substitution>
</regex>
<!-- removes duplicate slashes but -->
<!-- * allow 2 slashes after colon ':' (indicating protocol) -->
<regex>
<pattern>(?<!:)/{2,}</pattern>
<substitution>/</substitution>
</regex>
</regex-normalize>
Any idea what I'm doing wrong here?

Nutch's file: protocol implementation "fetches" local files by creating a File object using the path component of the URL: /cygdrive/c/Users/abc/Desktop/anotherdirectory/. As stated in the discussion "Is there a java sdk for cygwin?", Java does not translate the path, but replacing cygdrive/c/ by c:/ should work.

Related

How to disable JSON format and send only the log message to Sumologic with Fluentbit?

We are using Fluentbit as as Sidecar container in our ECS fargate Cluster which is running a dotnet application, initially we faced the issue of fluentbit sending the logs in multiline and we solved it using Fluentbit Multilne feature. Now the logs are being sent to Sumologic in Multiple however it is being sent as Json format whereas we just want fluentbit send only the raw log
Logs are currently
{
date:1675120653.269619,
container_id:"xvgbertytyuuyuyu",
container_name:"XXXXXXXXXX",
source:"stdout",
log:"2023-01-30 23:17:33.269Z DEBUG [.NET ThreadPool Worker] Connection.ManagedDbConnection - ComponentInstanceEntityAsync - Executing stored proc: dbo.prcGetComponentInstance"
}
We want only the line
2023-01-30 23:17:33.269Z DEBUG [.NET ThreadPool Worker] Connection.ManagedDbConnection - ComponentInstanceEntityAsync - Executing stored proc: dbo.prcGetComponentInstance

You need to modify Fluent Bit configuration to have the following filters and output configuration:
fluent.conf:
## prepare headers for Sumo Logic
[FILTER]
Name record_modifier
Match *
Record headers.content-type text/plain
## Set headers as headers attribute
[FILTER]
Name nest
Match *
Operation nest
Wildcard headers.*
Nest_under headers
Remove_prefix headers.
[OUTPUT]
Name http
...
# use log key as body
body_key $log
# use headers key as headers
headers_key $headers
That way, you are going to craft HTTP request manually. This is going to send request per log, which is not necessary a good idea. In order to mitigate that you can add the following parser and use it (flush_timeout may need an adjustment):
parsers.conf
# merge everything as one big log
[MULTILINE_PARSER]
name multiline-all
type regex
flush_timeout 500
#
# Regex rules for multiline parsing
# ---------------------------------
#
# configuration hints:
#
# - first state always has the name: start_state
# - every field in the rule must be inside double quotes
#
# rules | state name | regex pattern | next state
# ------|---------------|--------------------------------------------
rule "start_state" ".*" "cont"
rule "cont" ".*" "cont"
fluent.conf:
[INPUT]
name tail
...
multiline.parser multiline-all

Regex for "wp-admin" "wp-login" entries in syslog trying on drupal sites

I am looking for a fail2ban regex (or two) to find the wp-admin and wp-login attemps on drupal sites.
The regex should find "drupal:" and "page not found" and ("wp-admin" or "wp-login")
the problem for me are the "and" conditions
The logfile entries:
Apr 7 10:59:23 webserver drupal: https://www.anywebsite.com|1617785962|page not found|123.456.789.112|https://www.anywebsite.com/wp-admin/admin-ajax.php?action=revslider_show_image&img=../wp-config.php|https://anywebsite.com/wp-admin/admin-ajax.php?action=revslider_show_image&img=../wp-config.php|0||wp-admin/admin-ajax.php
Apr 7 06:53:47 webserver drupal: https://www.anywebsite.com|1617771227|page not found|123.456.789.112|https://www.anywebsite.com/wp/wp-login.php||0||wp/wp-login.php

Here you go:
failregex = ^\s*\S+ drupal: [^|]*\|\d+\|(?:page not found)\|<ADDR>
replace <ADDR> with <HOST> for fail2ban versions before v.0.10
WARNING Note that this assumes that first URI in your log-line (site? referrer?) after drupal: never contains a pipe-character (so an intruder is unable to add it to URI somehow to avoid ban). Otherwise it becomes complex (you must anchor it from both sides or write some conditional REs with lookaheads or lookbehinds).
Also note that if your side can make some 404 for legitimate users (because missing some references etc), you have to add to the RE some precise pattern excluding your missing pages to avoid false positives, e. g. something like this (with blacklisting expressions):
_block_uris = wp-admin|(?:wp/)wp-login
failregex = ^\s*\S+ drupal: [^|]*\|\d+\|(?:page not found)\|<ADDR>\|\w+://[^/]+/(?:%(_block_uris)s)
or (with white-listing expressions, here ignoring /my-page/ and my-site/ URIs):
_ignore_uris = my-page/|my-side/
failregex = ^\s*\S+ drupal: [^|]*\|\d+\|(?:page not found)\|<ADDR>\|\w+://[^/]+/(?!%(_ignore_uris)s)

How to create multiple mount points with different playlists in icecast with mpd?

I have installed icecast and MPD with YMPD client on the server.
Currently It is running for single mount. I want to stream audio on different mounting like: /stream.ogg, /mp3, /audio with different playlists.
Below is my config files:
1. mpd.conf:
# An example configuration file for MPD.
# Read the user manual for documentation: http://www.musicpd.org/doc/user/
# or /usr/share/doc/mpd/user-manual.html
# Files and directories #######################################################
#
# This setting controls the top directory which MPD will search to discover the
# available audio files and add them to the daemon's online database. This
# setting defaults to the XDG directory, otherwise the music directory will be
# be disabled and audio files will only be accepted over ipc socket (using
# file:// protocol) or streaming files over an accepted protocol.
#
music_directory "/var/lib/mpd/music"
#
# This setting sets the MPD internal playlist directory. The purpose of this
# directory is storage for playlists created by MPD. The server will use
# playlist files not created by the server but only if they are in the MPD
# format. This setting defaults to playlist saving being disabled.
#
playlist_directory "/var/lib/mpd/playlists"
#
# This setting sets the location of the MPD database. This file is used to
# load the database at server start up and store the database while the
# server is not up. This setting defaults to disabled which will allow
# MPD to accept files over ipc socket (using file:// protocol) or streaming
# files over an accepted protocol.
#
db_file "/var/lib/mpd/tag_cache"
#
# These settings are the locations for the daemon log files for the daemon.
# These logs are great for troubleshooting, depending on your log_level
# settings.
#
# The special value "syslog" makes MPD use the local syslog daemon. This
# setting defaults to logging to syslog, otherwise logging is disabled.
#
log_file "/var/log/mpd/mpd.log"
#
# This setting sets the location of the file which stores the process ID
# for use of mpd --kill and some init scripts. This setting is disabled by
# default and the pid file will not be stored.
#
pid_file "/run/mpd/pid"
#
# This setting sets the location of the file which contains information about
# most variables to get MPD back into the same general shape it was in before
# it was brought down. This setting is disabled by default and the server
# state will be reset on server start up.
#
state_file "/var/lib/mpd/state"
#
# The location of the sticker database. This is a database which
# manages dynamic information attached to songs.
#
sticker_file "/var/lib/mpd/sticker.sql"
#
###############################################################################
# General music daemon options ################################################
#
# This setting specifies the user that MPD will run as. MPD should never run as
# root and you may use this setting to make MPD change its user ID after
# initialization. This setting is disabled by default and MPD is run as the
# current user.
#
#user "mpd"
#
# This setting specifies the group that MPD will run as. If not specified
# primary group of user specified with "user" setting will be used (if set).
# This is useful if MPD needs to be a member of group such as "audio" to
# have permission to use sound card.
#
#group "nogroup"
#
# This setting sets the address for the daemon to listen on. Careful attention
# should be paid if this is assigned to anything other then the default, any.
# This setting can deny access to control of the daemon. Choose any if you want
# to have mpd listen on every address. Not effective if systemd socket
# activation is in use.
#
# For network
bind_to_address "127.0.0.1"
#
# And for Unix Socket
#bind_to_address "/run/mpd/socket"
#
# This setting is the TCP port that is desired for the daemon to get assigned
# to.
#
port "6600"
#
# This setting controls the type of information which is logged. Available
# setting arguments are "default", "secure" or "verbose". The "verbose" setting
# argument is recommended for troubleshooting, though can quickly stretch
# available resources on limited hardware storage.
#
#log_level "default"
#
# If you have a problem with your MP3s ending abruptly it is recommended that
# you set this argument to "no" to attempt to fix the problem. If this solves
# the problem, it is highly recommended to fix the MP3 files with vbrfix
# (available as vbrfix in the debian archive), at which
# point gapless MP3 playback can be enabled.
#
#gapless_mp3_playback "yes"
#
# Setting "restore_paused" to "yes" puts MPD into pause mode instead
# of starting playback after startup.
#
#restore_paused "no"
#
# This setting enables MPD to create playlists in a format usable by other
# music players.
#
#save_absolute_paths_in_playlists "no"
#
# This setting defines a list of tag types that will be extracted during the
# audio file discovery process. The complete list of possible values can be
# found in the mpd.conf man page.
#metadata_to_use "artist,album,title,track,name,genre,date,composer,performer,disc"
#
# This setting enables automatic update of MPD's database when files in
# music_directory are changed.
#
#auto_update "yes"
#
# Limit the depth of the directories being watched, 0 means only watch
# the music directory itself. There is no limit by default.
#
#auto_update_depth "3"
#
###############################################################################
# Symbolic link behavior ######################################################
#
# If this setting is set to "yes", MPD will discover audio files by following
# symbolic links outside of the configured music_directory.
#
#follow_outside_symlinks "yes"
#
# If this setting is set to "yes", MPD will discover audio files by following
# symbolic links inside of the configured music_directory.
#
#follow_inside_symlinks "yes"
#
###############################################################################
# Zeroconf / Avahi Service Discovery ##########################################
#
# If this setting is set to "yes", service information will be published with
# Zeroconf / Avahi.
#
#zeroconf_enabled "yes"
#
# The argument to this setting will be the Zeroconf / Avahi unique name for
# this MPD server on the network.
#
#zeroconf_name "Music Player"
#
###############################################################################
# Permissions #################################################################
#
# If this setting is set, MPD will require password authorization. The password
# can setting can be specified multiple times for different password profiles.
#
#password "password#read,add,control,admin"
#
# This setting specifies the permissions a user has who has not yet logged in.
#
default_permissions "read,add,control,admin"
#
###############################################################################
# Database #######################################################################
#
#database {
# plugin "proxy"
# host "other.mpd.host"
# port "6600"
#}
# Input #######################################################################
#
input {
plugin "curl"
# proxy "proxy.isp.com:8080"
# proxy_user "user"
# proxy_password "password"
}
#
###############################################################################
# Audio Output ################################################################
#
# MPD supports various audio output types, as well as playing through multiple
# audio outputs at the same time, through multiple audio_output settings
# blocks. Setting this block is optional, though the server will only attempt
# autodetection for one sound card.
#
# An example of an ALSA output:
#
audio_output {
type "alsa"
name "My ALSA Device"
# device "hw:0,0" # optional
# mixer_type "hardware" # optional
# mixer_device "default" # optional
# mixer_control "PCM" # optional
# mixer_index "0" # optional
}
#
# An example of an OSS output:
#
#audio_output {
# type "oss"
# name "My OSS Device"
# device "/dev/dsp" # optional
# mixer_type "hardware" # optional
# mixer_device "/dev/mixer" # optional
# mixer_control "PCM" # optional
#}
#
# An example of a shout output (for streaming to Icecast):
#
audio_output {
type "shout"
encoding "mp3" # optional
name "My Shout Stream"
host "localhost"
port "8000"
bind_to_address "127.0.0.1"
mount "/mp3"
password "4getme!"
quality "5.0"
# bitrate "128"
format "44100:16:1"
# protocol "icecast2" # optional
# user "USER" # optional
description "My Stream Description" # optional
# url "http://example.com" # optional
# genre "jazz" # optional
# public "no" # optional
# timeout "2" # optional
# mixer_type "software" # optional
}
#
# An example of a recorder output:
#
#audio_output {
# type "recorder"
# name "My recorder"
# encoder "vorbis" # optional, vorbis or lame
# path "/var/lib/mpd/recorder/mpd.ogg"
## quality "5.0" # do not define if bitrate is defined
# bitrate "128" # do not define if quality is defined
# format "44100:16:1"
#}
#
# An example of a httpd output (built-in HTTP streaming server):
#
#audio_output {
# type "httpd"
# name "My HTTP Stream"
# encoder "vorbis" # optional, vorbis or lame
# port "8000"
# bind_to_address "0.0.0.0" # optional, IPv4 or IPv6
# quality "5.0" # do not define if bitrate is defined
# bitrate "128" # do not define if quality is defined
# format "44100:16:1"
# max_clients "0" # optional 0=no limit
#}
#
# An example of a pulseaudio output (streaming to a remote pulseaudio server)
# Please see README.Debian if you want mpd to play through the pulseaudio
# daemon started as part of your graphical desktop session!
#
audio_output {
type "pulse"
name "My Pulse Output"
# server "remote_server" # optional
# sink "remote_server_sink" # optional
}
#
# An example of a winmm output (Windows multimedia API).
#
#audio_output {
# type "winmm"
# name "My WinMM output"
# device "Digital Audio (S/PDIF) (High Definition Audio Device)" # optional
# or
# device "0" # optional
# mixer_type "hardware" # optional
#}
#
# An example of an openal output.
#
#audio_output {
# type "openal"
# name "My OpenAL output"
# device "Digital Audio (S/PDIF) (High Definition Audio Device)" # optional
#}
#
## Example "pipe" output:
#
#audio_output {
# type "pipe"
# name "my pipe"
# command "aplay -f cd 2>/dev/null"
## Or if you're want to use AudioCompress
# command "AudioCompress -m | aplay -f cd 2>/dev/null"
## Or to send raw PCM stream through PCM:
# command "nc example.org 8765"
# format "44100:16:2"
#}
#
## An example of a null output (for no audio output):
#
#audio_output {
# type "null"
# name "My Null Output"
# mixer_type "none" # optional
#}
#
# If MPD has been compiled with libsamplerate support, this setting specifies
# the sample rate converter to use. Possible values can be found in the
# mpd.conf man page or the libsamplerate documentation. By default, this is
# setting is disabled.
#
#samplerate_converter "Fastest Sinc Interpolator"
#
###############################################################################
# Normalization automatic volume adjustments ##################################
#
# This setting specifies the type of ReplayGain to use. This setting can have
# the argument "off", "album", "track" or "auto". "auto" is a special mode that
# chooses between "track" and "album" depending on the current state of
# random playback. If random playback is enabled then "track" mode is used.
# See <http://www.replaygain.org> for more details about ReplayGain.
# This setting is off by default.
#
#replaygain "album"
#
# This setting sets the pre-amp used for files that have ReplayGain tags. By
# default this setting is disabled.
#
#replaygain_preamp "0"
#
# This setting sets the pre-amp used for files that do NOT have ReplayGain tags.
# By default this setting is disabled.
#
#replaygain_missing_preamp "0"
#
# This setting enables or disables ReplayGain limiting.
# MPD calculates actual amplification based on the ReplayGain tags
# and replaygain_preamp / replaygain_missing_preamp setting.
# If replaygain_limit is enabled MPD will never amplify audio signal
# above its original level. If replaygain_limit is disabled such amplification
# might occur. By default this setting is enabled.
#
#replaygain_limit "yes"
#
# This setting enables on-the-fly normalization volume adjustment. This will
# result in the volume of all playing audio to be adjusted so the output has
# equal "loudness". This setting is disabled by default.
#
#volume_normalization "no"
#
###############################################################################
# Character Encoding ##########################################################
#
# If file or directory names do not display correctly for your locale then you
# may need to modify this setting.
#
filesystem_charset "UTF-8"
#
# This setting controls the encoding that ID3v1 tags should be converted from.
#
id3v1_encoding "UTF-8"
#
###############################################################################
# SIDPlay decoder #############################################################
#
# songlength_database:
# Location of your songlengths file, as distributed with the HVSC.
# The sidplay plugin checks this for matching MD5 fingerprints.
# See http://www.c64.org/HVSC/DOCUMENTS/Songlengths.faq
#
# default_songlength:
# This is the default playing time in seconds for songs not in the
# songlength database, or in case you're not using a database.
# A value of 0 means play indefinitely.
#
# filter:
# Turns the SID filter emulation on or off.
#
#decoder {
# plugin "sidplay"
# songlength_database "/media/C64Music/DOCUMENTS/Songlengths.txt"
# default_songlength "120"
# filter "true"
#}
#
###############################################################################
2. icecast.xml
<icecast>
<!-- location and admin are two arbitrary strings that are e.g. visible
on the server info page of the icecast web interface
(server_version.xsl). -->
<location>Earth</location>
<admin>icemaster#localhost</admin>
<!-- IMPORTANT!
Especially for inexperienced users:
Start out by ONLY changing all passwords and restarting Icecast.
For detailed setup instructions please refer to the documentation.
It's also available here: http://icecast.org/docs/
-->
<limits>
<clients>100</clients>
<sources>2</sources>
<queue-size>524288</queue-size>
<client-timeout>30</client-timeout>
<header-timeout>15</header-timeout>
<source-timeout>10</source-timeout>
<!-- If enabled, this will provide a burst of data when a client
first connects, thereby significantly reducing the startup
time for listeners that do substantial buffering. However,
it also significantly increases latency between the source
client and listening client. For low-latency setups, you
might want to disable this. -->
<burst-on-connect>1</burst-on-connect>
<!-- same as burst-on-connect, but this allows for being more
specific on how much to burst. Most people won't need to
change from the default 64k. Applies to all mountpoints -->
<burst-size>65535</burst-size>
</limits>
<authentication>
<!-- Sources log in with username 'source' -->
<source-password>4getme!</source-password>
<!-- Relays log in with username 'relay' -->
<relay-password>4getme!</relay-password>
<!-- Admin logs in with the username given below -->
<admin-user>admin</admin-user>
<admin-password>4getme!</admin-password>
</authentication>
<!-- set the mountpoint for a shoutcast source to use, the default if not
specified is /stream but you can change it here if an alternative is
wanted or an extension is required
<shoutcast-mount>/live.nsv</shoutcast-mount>
-->
<!-- Uncomment this if you want directory listings -->
<directory>
<yp-url-timeout>15</yp-url-timeout>
<yp-url>/var/lib/mpd/music</yp-url>
</directory>
-->
<!-- This is the hostname other people will use to connect to your server.
It affects mainly the urls generated by Icecast for playlists and yp
listings. You MUST configure it properly for YP listings to work!
-->
<hostname>localhost</hostname>
<!-- You may have multiple <listener> elements -->
<listen-socket>
<port>8000</port>
<!-- <bind-address>127.0.0.1</bind-address> -->
<!-- <shoutcast-mount>/stream</shoutcast-mount> -->
</listen-socket>
<listen-socket>
<port>8005</port>
<bind-address>127.0.0.1</bind-address>
<shoutcast-mount>/stream</shoutcast-mount>
</listen-socket>
<listen-socket>
<port>8006</port>
</listen-socket>
<listen-socket>
<port>8007</port>
<shoutcast-mount>/live.mp3</shoutcast-mount>
</listen-socket>
<!--
<listen-socket>
<port>8443</port>
<ssl>1</ssl>
</listen-socket>
-->
<!-- Global header settings
Headers defined here will be returned for every HTTP request to Icecast.
The ACAO header makes Icecast public content/API by default
This will make streams easier embeddable (some HTML5 functionality needs it).
Also it allows direct access to e.g. /status-json.xsl from other sites.
If you don't want this, comment out the following line or read up on CORS.
-->
<http-headers>
<header name="Access-Control-Allow-Origin" value="*" />
</http-headers>
<!-- Relaying
You don't need this if you only have one server.
Please refer to the config for a detailed explanation.
-->
<!--<master-server>127.0.0.1</master-server>-->
<!--<master-server-port>8001</master-server-port>-->
<!--<master-update-interval>120</master-update-interval>-->
<!--<master-password>hackme</master-password>-->
<!-- setting this makes all relays on-demand unless overridden, this is
useful for master relays which do not have <relay> definitions here.
The default is 0 -->
<!--<relays-on-demand>1</relays-on-demand>-->
<!--
<relay>
<server>127.0.0.1</server>
<port>8080</port>
<mount>/example.ogg</mount>
<local-mount>/different.ogg</local-mount>
<on-demand>0</on-demand>
<relay-shoutcast-metadata>0</relay-shoutcast-metadata>
</relay>
-->
<!-- Mountpoints
Only define <mount> sections if you want to use advanced options,
like alternative usernames or passwords
-->
<!-- Default settings for all mounts that don't have a specific <mount type="normal">.
-->
<!--
<mount type="default">
<public>0</public>
<intro>/server-wide-intro.ogg</intro>
<max-listener-duration>3600</max-listener-duration>
<authentication type="url">
<option name="mount_add" value="http://auth.example.org/stream_start.php"/>
</authentication>
<http-headers>
<header name="foo" value="bar" />
</http-headers>
</mount>
-->
<!-- Normal mounts -->
<mount type="normal">
<mount-name>/stream.ogg</mount-name>
<username>admin</username>
<password>4getme!</password>
<max-listeners>1</max-listeners>
<dump-file>/tmp/dump-example1.ogg</dump-file>
<burst-size>65536</burst-size>
<fallback-mount>/example2.ogg</fallback-mount>
<fallback-override>1</fallback-override>
<fallback-when-full>1</fallback-when-full>
<intro>/example_intro.ogg</intro>
<hidden>1</hidden>
<public>1</public>
<authentication type="htpasswd">
<option name="filename" value="myauth"/>
<option name="allow_duplicate_users" value="0"/>
</authentication>
<http-headers>
<header name="Access-Control-Allow-Origin" value="http://webplayer.example.org" />
<header name="baz" value="quux" />
</http-headers>
<on-connect>/home/icecast/bin/stream-start</on-connect>
<on-disconnect>/home/icecast/bin/stream-stop</on-disconnect>
</mount>
<!--
<mount type="normal">
<mount-name>/auth_example.ogg</mount-name>
<authentication type="url">
<option name="mount_add" value="http://myauthserver.net/notify_mount.php"/>
<option name="mount_remove" value="http://myauthserver.net/notify_mount.php"/>
<option name="listener_add" value="http://myauthserver.net/notify_listener.php"/>
<option name="listener_remove" value="http://myauthserver.net/notify_listener.php"/>
<option name="headers" value="x-pragma,x-token"/>
<option name="header_prefix" value="ClientHeader."/>
</authentication>
</mount>
-->
<fileserve>1</fileserve>
<paths>
<!-- basedir is only used if chroot is enabled -->
<basedir>/usr/share/icecast2</basedir>
<!-- Note that if <chroot> is turned on below, these paths must both
be relative to the new root, not the original root -->
<logdir>/var/log/icecast2</logdir>
<webroot>/usr/share/icecast2/web</webroot>
<adminroot>/usr/share/icecast2/admin</adminroot>
<pidfile>/usr/share/icecast2/icecast.pid</pidfile>
<!-- Aliases: treat requests for 'source' path as being for 'dest' path
May be made specific to a port or bound address using the "port"
and "bind-address" attributes.
-->
<!--
<alias source="/foo" destination="/bar"/>
-->
<!-- Aliases: can also be used for simple redirections as well,
this example will redirect all requests for http://server:port/ to
the status page
-->
<alias source="/" destination="/status.xsl"/>
<!-- The certificate file needs to contain both public and private part.
Both should be PEM encoded.
<ssl-certificate>/usr/share/icecast2/icecast.pem</ssl-certificate>
-->
</paths>
<logging>
<accesslog>access.log</accesslog>
<errorlog>error.log</errorlog>
<!-- <playlistlog>playlist.log</playlistlog> -->
<loglevel>3</loglevel> <!-- 4 Debug, 3 Info, 2 Warn, 1 Error -->
<logsize>10000</logsize> <!-- Max size of a logfile -->
<!-- If logarchive is enabled (1), then when logsize is reached
the logfile will be moved to [error|access|playlist].log.DATESTAMP,
otherwise it will be moved to [error|access|playlist].log.old.
Default is non-archive mode (i.e. overwrite)
-->
<!-- <logarchive>1</logarchive> -->
</logging>
<security>
<chroot>0</chroot>
<!-- <changeowner>
<user>icecast2</user>
<group>icecast</group>
</changeowner> -->
</security>
</icecast>
how can I mount with different streaming file or different playlists?

I have done multiple mounting on the icecast. For this I have create multiple instances of mpd on the server.
And start each mpd separated on different ports. Port added in configuration file.
For example:
mpd /etc/mpd.conf # mpd's configuration file path for start mpd
After that I have started new YMPD instance on new port as mpd's client.
For example:
./ympd --port 6600 --webport 8085 # 6600 is mpd's port and 8085 is ympd's port.
Now each mpd have different playlist and different mount url for streaming audio.

Default Config
In the default config you have a few lines that look like this:
<listen-socket>
<port>8000</port>
<bind-address>0.0.0.0</bind-address>
<shoutcast-mount>/stream</shoutcast-mount>
</listen-socket>
You'll also see some commented out lines like this:
<!-- Normal mounts -->
<!--
<mount type="normal">
<mount-name>/example-complex.ogg</mount-name>
...
</mount>
-->
Multiple Mounts
If you have multiple source clients, you can publish to multiple streams by adding "normal" (non-default) mounts:
<listen-socket>
<port>8000</port>
<bind-address>0.0.0.0</bind-address>
<shoutcast-mount>/stream</shoutcast-mount>
</listen-socket>
<mount type="normal">
<mount-name>/stream-128k</mount-name>
</mount>
<mount type="normal">
<mount-name>/stream-64k</mount-name>
</mount>
I believe that strict shoutcast only supports a single mount per each bound port - it has no "paths".
However, you can have as many icecast mounts as you want, provided that you have source clients to publish to them.
liquidsoap can make it easy to publish multiple bitrates at once.

HAProxy 1.6+: rewrite host based on path

I'm trying to redirect all requested of type:
static.domain.com/site1/resource.jpg
static.domain.com/site1/resource2.js
static.domain.com/site2/resource3.gif
static.domain.com/site2/someDir/resource4.txt
to
site1.domain.com/resource.jpg
site1.domain.com/resource2.js
site2.domain.com/resource3.gif
site2.domain.com/someDir/resource4.txt
Basically, if the host is static.domain.com:
New subdomain is based on the the first part of the original path, with same TLD
New path is the original path not including the first part
I am pretty sure regexps can solve this, just not sure how to modify one header based on another..

At first I thought this might work:
# Detect hosts of the format static.*
acl host_static hdr_beg(host) -i static.
# Style using reqirep
# -------------
# Replace "static.domain.com" with "someFolder.domain.com" if the host is static.* and the path has at least two / symbols
# This causes: static.domain.com ===> whatever3.domain.com
#reqirep ^([^\ :]*\ /)([^/]+)(/.*\n)(^(?:[a-zA-Z0-9()\-=\*\.\?;,+\/&_]+:\ .+\n)+)*Host:\ static\.([^/]+?)$ \1\2\3\4Host:\ \2.\5 if host_static
#
# Replace "/someFolder/" with "/" at the beginning of any request path, if the host is static.*
# This causes: /whatever3/another/long/path ===> /another/long/path
#reqirep ^([^\ :]*)\ /[^/]+/(.*) \1\ /\2 if host_static
#---------------
but it doesn't work as expected. The regexp works properly in controlled tests, but not in haproxy itself. Probably an issue of directive processing and execution order. (perhaps the modification of the request path screws the first regexp?)
I then tried this:
# Style using set-var, set-path etc
#---------------
#http-request set-var(req.first_path_part) path,field(2,/) if host_static
#http-request set-var(req.last_host_part) hdr(host),regsub(^static\.,) if host_static
#http-request replace-header Host .* %[var(req.first_path_part)].%[var(req.last_host_part)] if host_static
#http-request set-path %[path,regsub(^/.*?/,/)] if host_static
#---------------
Once again, it almost works, but for some reason the host doesn't get replaced properly.
Since this was only used by the QA env, and the behaviour is different from Production anyways (static.*, in my case, would point to a CDN), I decided this is a sufficient solution for now:
# New style, using set-var and redirection.
#---------------
http-request set-var(req.first_path_part) path,field(2,/) if host_static
http-request set-var(req.last_host_part) hdr(host),regsub(^static\.,) if host_static
http-request redirect location https://%[var(req.first_path_part)].%[var(req.last_host_part)]%[path,regsub(^/.*?/,/)] code 302 if host_static
#---------------

I'm not sure how HAProxy works, but I can help you with the regex.
Try: ^static\.([^/]+)/([^/]+)/(.*)$
Your new URL will be \2.\1/\3.
Note that you may need to escape the /s in the regex (which would make it \/).

Get thrown "Generator: 0 records selected for fetching" when trying to crawl a small majority of websites using Nutch

I have a site that runs using moderngov.co.uk (you send them a template, which they then upload). I'm trying to crawl this site so it can indexed by Solr and searched through a drupal site. I can crawl the vast majority of websites out there, but for some reason I am unable to crawl this one: http://scambs.moderngov.co.uk/uuCoverPage.aspx?bcr=1
The specific error I get is this:
Injector: starting at 2013-10-17 13:32:47
Injector: crawlDb: X-X/crawldb
Injector: urlDir: urls/seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 0
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-10-17 13:32:50, elapsed: 00:00:02
Thu, Oct 17, 2013 1:32:50 PM : Iteration 1 of 2
Generating a new segment
Generator: starting at 2013-10-17 13:32:51
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
I'm not sure if it's got something to do with the regex patterns Nutch uses to parse html, or if there's a redirect that's causing issues, or something else entirely. Below are a few of the nutch config files:
Here are the urlfilters: http://pastebin.com/ZqeZUJa1
sysinfo:
Windows 7 (64-bit)
Solr 3.6.2
Apache Nutch 1.7
If anyone has come across this problem before, or might know why this is happening, any help would be greatly appreciated.
Thanks

I tried that seed url and I got this error:
Denied by robots.txt: http://scambs.moderngov.co.uk/uuCoverPage.aspx?bcr=1
Looking at the robots.txt file of that site:
# Disallow all webbot searching
User-agent: *
Disallow: /
You have to set a specific user agent in Nutch and modify the website to accept crawling form your user agent.
The property to change in Nutch is in conf/nutch-site.xml:
<property>
<name>http.agent.name</name>
<value>nutch</value>
</property>

try this
<property>
<name>db.fetch.schedule.class</name>
<value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
</property>
<property>
<name>db.fetch.interval.default</name>
<value>10</value>
<description>The default number of seconds between re-fetches of a page (30 days).
</description>
</property>
<property>
<name>db.fetch.interval.max</name>
<!-- for now always re-fetch everything -->
<value>100</value>
<description>The maximum number of seconds between re-fetches of a page
(less than one day). After this period every page in the db will be re-tried, no
matter what is its status.
</description>
</property>

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js