Localizing Github repo? - regex

I'm not sure if this is the right place to ask this question, although I'm sure someone might be able to guide me in the right direction. The problem/question I have is about a github repo that is in another language (which I haven't the faintest clue of understanding). Is there a standard way of translating/localizing an entire codebase/repo? I have some ideas of ways that might be (tedious) to do it, but I'm looking for the correct or best way.
Fortunately the code seems to be mostly in english, but all of the comments and output for example are in cyrillic (not sure if it's russian, serbian, etc). Which approach do I need to take to make this as painless as possible? Is regex the answer, or is something like a copy/paste (ugh!) using google translate? Thanks in advance.
Code Sample:
if (!defined("FILE")) die("Ошибка доступа.");
$data_filename = "data.txt"; // имя файла с данными сниффера
...
Google Translate:
if (defined ("FILE")) die ("Error Access.");
data_filename $ = "data.txt"; // Filename with data sniffer
...
While far from perfect, it does translate - but screws with the code.

Here's an idea that works for the example you posted (though definitely not for every language in the world):
import re
import sys
for line in sys.stdin:
foreignStart = None
for ii, ch in enumerate(line):
if foreignStart is None and ord(ch) > 127:
foreignStart = ii
elif foreignStart is not None and ord(ch) <= 127 and ch != ' ':
sys.stdout.write("TRANSLATE(")
sys.stdout.write(line[foreignStart:ii])
sys.stdout.write(")")
foreignStart = None
if ord(ch) <= 127 and not (foreignStart is not None and ch == ' '):
sys.stdout.write(ch)
This will print:
if (!defined("FILE")) die("TRANSLATE(Ошибка доступа).");
$data_filename = "data.txt"; // TRANSLATE(имя файла с данными сниффера)
...
I think you can see where you'll need to insert calls to a machine translation API (Google or Yandex...). You should of course cache results you've already translated in a text file somewhere, which will also allow you to override results you don't like from the API.

Related

Google Text to Speech – words not read after break

I'm trying to get Google TTS to read aloud a short set of words and pause between each word. An example of the kind of SSML I send to the Google Cloud:
<speak>chaume<break time="3s"/> cuivré, relatif au cuivre</speak>
The first word gets read, then the voice pauses for three seconds, but everything that comes after gets dropped down. I have successfully had TTS read longer sentences that contained breaks, such as this one, with identical code:
<speak>Se pure vagolavano allora per una Parma stupenda, prima dello <break time="3s"/>scempio della Bassa dei Magnani orrendamente ricostruita.</speak>
There does not seam to be any difference between the two samples, what is it that goes wrong with the first one?
My very slightly customized version of the synthesizing function is the following:
def synthesize_text(ssml_text,file_name,tts_lang,tts_voice_name):
"""Synthesizes speech from the input string of text."""
client = texttospeech.TextToSpeechClient(credentials=credentials)
input_text = texttospeech.SynthesisInput(ssml=ssml_text)
# Note: the voice can also be specified by name.
# Names of voices can be retrieved with client.list_voices().
voice = texttospeech.VoiceSelectionParams(
language_code=tts_lang,
name=tts_voice_name,
ssml_gender=texttospeech.SsmlVoiceGender.FEMALE,
)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3
)
response = client.synthesize_speech(
request={"input": input_text, "voice": voice, "audio_config": audio_config}
)
# The response's audio_content is binary.
with open(f"{home}/Documents/{file_name}.mp3", "wb") as out:
out.write(response.audio_content)
Well it proved enough to delete the following lines:
name=tts_voice_name,
ssml_gender=texttospeech.SsmlVoiceGender.FEMALE
The voice_name was fr-FR-Standard-A, a WaveNet voice. Whereas the language code was fr-CA. I'm quite sure the discrepancy caused the strange behaviour.

Suppress warnings from CPD for C/C++ code

We are using PMD Copy Paste Detector (CPD) to analyze our C and C++ code.
However, there are a few parts of the code that are very similar, but with a good reason and we would like to suppress the warnings for these parts.
The documentation of PMD CPD only mentions something about annotations, but this will not work for our these languages.
How can I still ignore warnings for specific parts?
Is there a comment to do so perhaps?
[UPDATE] I'm using the following Groovy script to run CPD:
#GrabResolver(name = 'jcenter', root = 'https://jcenter.bintray.com/')
#Grab('net.sourceforge.pmd:pmd-core:5.4.+')
#Grab('net.sourceforge.pmd:pmd-cpp:5.4.+')
import net.sourceforge.pmd.cpd.CPD
import net.sourceforge.pmd.cpd.CPDConfiguration
import java.util.regex.Pattern
def tokens = 60
def scanDirs = ['./path/to/scan', './scan/this/too']
def ignores = [
'./ignore/this/path',
'./this/must/be/ignored/too'
].collect({ it.replace('/', File.separator) })
def rootDir = new File('.')
def outputDir = new File('./reports/analysis/')
def filename_date_format = 'yyyyMMdd'
def encoding = System.getProperty('file.encoding')
def language_converter = new CPDConfiguration.LanguageConverter()
def config = new CPDConfiguration()
config.language = new CPDConfiguration.LanguageConverter().convert('c')
config.minimumTileSize = tokens
config.renderer = config.getRendererFromString 'xml', 'UTF-8'
config.skipBlocksPattern = '//DUPSTOP|//DUPSTART'
config.skipLexicalErrors = true
def cpd = new CPD(config)
scanDirs.each { path ->
def dir = new File(path);
dir.eachFileRecurse(groovy.io.FileType.FILES) {
// Ignore file?
def doIgnore = false
ignores.each { ignore ->
if(it.path.startsWith(ignore)) {
doIgnore = true
}
}
if(doIgnore) {
return
}
// Other checks
def lowerCaseName = it.name.toLowerCase()
if(lowerCaseName.endsWith('.c') || lowerCaseName.endsWith('.cpp') || lowerCaseName.endsWith('.h')) {
cpd.add it
}
}
}
cpd.go();
def duplicationFound = cpd.matches.hasNext()
def now = new Date().format(filename_date_format)
def outputFile = new File(outputDir.canonicalFile, "cpd_report_${now}.xml")
println "Saving report to ${outputFile.absolutePath}"
def absoluteRootDir = rootDir.canonicalPath
if(absoluteRootDir[-1] != File.separator) {
absoluteRootDir += File.separator
}
outputFile.parentFile.mkdirs()
def xmlOutput = config.renderer.render(cpd.matches);
if(duplicationFound) {
def filePattern = "(<file\\s+line=\"\\d+\"\\s+path=\")${Pattern.quote(absoluteRootDir)}([^\"]+\"\\s*/>)"
xmlOutput = xmlOutput.replaceAll(filePattern, '$1$2')
} else {
println 'No duplication found.'
}
outputFile.write xmlOutput
You can define your custom markers for excluding certain blocks from analysis through the --skip-blocks-pattern option.
--skip-blocks-pattern Pattern to find the blocks to skip. Start and End pattern separated by |. Default is #if 0|#endif.
For example the following will ignore blocks between /* SUPPRESS CPD START */ and /* SUPPRESS CPD END */ comments (the comment must occupy a separate line):
$ ./run.sh cpd --minimum-tokens 100 --files /path/to/c/source --language cpp ----skip-blocks-pattern '/* SUPPRESS CPD START */|/* SUPPRESS CPD END */'
Note however, that this will cause the tool perform copy-paste-detection inside code delimited by #if 0/#endif.
After searching through the code of PMD on GitHub, I think I can safely say that this is NOT supported at this point in time (current version being PMD 5.5.0).
A search for CPD-START in their repository, does not show any results within the pmd-cpp directory (see the search results on GitHub).
I know this is a ~3 years old question, but for completeness, CPD started supporting this in PMD 5.6.0 (April 2017) in Java, and since 6.3.0 (April 2018) it has been extended to many other languages such as C/C++. Nowadays, almost all CPD supported languages allow for comment-based suppressions.
The complete (current) docs for comment-based suppression are available at https://pmd.github.io/pmd-6.13.0/pmd_userdocs_cpd.html#suppression
It's worth noting, if a file has a // CPD-OFF comment, but no matching // CPD-ON, everything will be ignored until the end of file.
I don't have any help for CPD. In general, I know about such tools; I don't understand the bit about "warnings".
Our CloneDR tool finds exact and near-miss duplicate code. IMHO, it finds better clones than CPD, because it uses the language syntax/ structure as a guide. [This fact is backed up by a research report done by a third party that you can find at the site]. And it does not issue "warnings".
If there is code that it thinks is involved in a clone, the tool will generate an output report page for the clones involved. But that isn't a warning. There is no way to suppress the reporting behavior. Obviously, if you have seen such a clone and decide it is not interesting, you can mark one of the clone entries with a comment stating that it is an uninteresting clone; that comment will show up in the clone report. (Such) comments have no impact whatsover on what clones are detected by CloneDR, so adding them does not change the computed answer.

Regex'ed string not always used

I am currently rewriting some custom perl-code used in the OTRS-ticketing-system, which is used to create SQL-like queries. Yes, there are probably better ways of escaping input, but let's not go into this...
$Param{PostMasterSearch} contains an email-adress like test'test#domain.tld (Note the ').
my $PostMasterSearch = $Param{PostMasterSearch};
$PostMasterSearch =~ s/'//gms;
$Self->{LogObject}->Log(
Priority => 'error',
Message => "XXXXX: $PostMasterSearch",
);
$SQLExt .= " $Field LIKE '$PostMasterSearch'";
So my expectation would be, that I'll find a log-message saying XXXXX: testtest#domain.tld and part of a SQL-query that goes like Email LIKE 'testtest#domain.tld'.
But in reality, I only get the log-message - the SQL-query-string is for whatever reason Email LIKE 'test'test#domain.tld'.
Screwing with the last line of the code to be like
$SQLExt .= " $Field LIKE '$PostMasterSearch' X";
doesn't make any sense - but returns the string Email LIKE 'testtest#domain.tld' X.
Any hints on why $PostMasterSearch is still containing that ' that should have been long gone? Or a hint on how to concatenate the $SQLExt with the '-less version of $PostMasterSearch?
OK, now this is the part where it gets embarrassing...
Turns out, that this script contains the very same code multiple times and executes it multiple times...
So the code posted above actually works and doesn't return any error (like it should). The error-message is caused by the next occurence of the code, that hasn't been patched yet.
tl;dr: I probably should trash that script and rewrite it from scratch.

Using regular expressions in python to determine C++ functions and their parameters

So I'm doing something wrong in this python script, but it's becoming convoluted and I'm losing sight of what I'm doing wrong.
I want a script to go through a file, find all the function definitions, and then pull out the name, return type, and parameters of the function, and output a "doxygen" style comment like this:
/******************************************************************************/
/*!
\brief
Main function for the file
\return
The exit code for the program
*/
/******************************************************************************/
But I'm doing something wrong with the regular expression in trying to parse the parameters... Here is the script so far:
import re
import sys
f = open(sys.argv[1])
functions = []
for line in f:
match = re.search(r'([\w]+)\s+([\S]+)\(([\w+\s+\w+])+\)',line)
if line.find("\\fn") < 0:
if match:
returntype = match.group(1)
funcname = match.group(2)
print '/********************************************************************'
print " \\fn " + match.group()
print ''
print ' \\brief'
print ' Function description for ' + funcname
print ''
if len(match.groups()) > 2:
params = []
count = len(match.groups()) - 2
while count > 0:
matchingstring = match.group(count + 2)
if matchingstring.find("void") < 0:
params.append(matchingstring)
count -= 1
for parameter in params:
print " \\param " + parameter
print ' Description of ' + parameter
print ''
print ' \\return'
print ' ' + returntype
print '********************************************************************/'
print ''
Any help would be appreciated. Thanks
The grammar of C++ is far too complex to be handled by simple
regular expressions. You'll need at least a minimal parser.
I've found that for restricted cases, where I'm not concerned
with C++ in general, but only my own style, I can often get away
with a flex based tokenizer and a simple state machine. This
will fail in many cases of legal C++—for starters, of
course, if someone uses the pre-processor to modify the syntax;
but also because < can have different meanings, depending on
what precedes it names a template or not. But it's often
adequate for a specific job.
I've used a PEG parser with great success when trying to do simple format parsing. pyPeg is a very simple implementation of such a parser written in Python.
Example Python code for C++ function parser:
EDIT: Address template parameters. Tested with input from SK-logic and output is correct.
import pyPEG
from pyPEG import parseLine
import re
def symbol(): return re.compile(r"[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ&*][\w:]+")
def type(): return symbol
def functionName(): return symbol
def templatedType(): return symbol, "<", -1, [templatedType, symbol, ","], ">"
def parameter(): return [templatedType, type], symbol
def template(): return "<", -1, [symbol, template], ">"
def function(): return [type, templatedType], functionName, -1, template, "(", -1, [",", parameter], ")" # -1 -> zero or more repetitions.
sourceCode = "std::string foobar(std::vector<int> &A, std::map<std::string, std::vector<std::string> > &B)"
results = parseLine(sourceCode, function(), [], packrat=True)
When this is executed results is:
([(u'type', [(u'symbol', 'std::string')]), (u'functionName', [(u'symbol', 'foobar')]), (u'parameter', [(u'templatedType', [(u'symbol', 'std::vector'), (u'symbol', 'int')]), (u'symbol', '&A')]), (u'parameter', [(u'templatedType', [(u'symbol', 'std::map'), (u'symbol', 'std::string'), (u'templatedType', [(u'symbol', 'std::vector'), (u'symbol', 'std::string')])]), (u'symbol', '&B')])], '')
C++ cannot really be parsed by a (sane) regular expression: they are a nightmare as soon as nesting is concerned.
There is another concern too, determining when to parse and when not to. A function may be declared:
at file scope
in a namespace
in a class
And the two last can be nested at arbitrary depths.
I would propose to use CLang here. It's a real C++ front-end with a full-featured parser and there are:
a C API, with (notably) an API to the Indexing Library
Python bindings on top of the C API
The C API and Python bindings are far from fully exposing the underlying C++ model, but for a task as simple as listing functions it should be enough.
That said, I would question the usefulness of the project: if the documentation can be generated by a simple parser, then it is redundant with the code. And redundancy is at best, useless, and worst dangerous: it introduces the potential threat of desynchronization...
If the function is tricky enough that its use requires documentation, then a developer, who knows the limitations and al, has to write this documentation.

I am new to RUBY and i need to understand 3 functions

I have been given the 3 functions below. Can anybody please help me to understand these? I am trying to port an application to C++ using Qt, but I don't understand these functions. So please help me!
Thanks in advance.
function 1:
def read_key
puts "read pemkey: \"#{#pkey}\"" if #verbose
File.open(#pkey, 'rb') do |io|
#key = OpenSSL::PKey::RSA.new(io)
end
end
function 2:
def generate_key
puts "generate pemkey to \"#{#pkey_o}\"" if #verbose
#key = OpenSSL::PKey::RSA.generate(KEY_SIZE)
# save key
File.open(#pkey_o, 'wb') do |file|
file << #key.export()
end
end
function 3:
def sign_zip
puts "sign zip" if #verbose
plain = nil
File.open(#zip, 'rb') do |file|
plain = file.read
end
#sig = #key.sign(OpenSSL::Digest::SHA1.new, plain)
end
There are probably two things about the above code that are confusing you, which if clarified, will help understand it.
First, #verbose and #key are instance variables, what a C++ programmer might call "member variables." The "if #verbose" following the puts statement literally means only do the puts if #verbose is true. #verbose never needs to be declared a bool--you just start using it. If it's never initialized, it's "nil" which evaluates to false.
Second, the do/end parts are code blocks. Many Ruby methods take a code block and execute it with a variable declared in those pipe characters. An example would be "array.each do |s| puts s; end" which might look like "for(int i = 0; i < array.size(); ++i) { s = array[i]; puts(s); }" in C++. For File.open, |io| is the file instance opened, and "read" is one of its methods.
These are all methods. #{#pkey_o} is string interpolation, substituting in the contents of an instance variable (called pkey_o; Ruby instance variables begin with # and class variables – unused here – begin with ##).
File.open(#pkey, 'rb') do |io|
#key = OpenSSL::PKey::RSA.new(io)
end
That opens the file whose name is stored in #pkey, stores the file handle in io (a block-local variable) and uses that with OpenSSL::PKey::RSA.new, whose result is stored in #key. Finally, it closes the file handle when the block is finished (at the end) whether or not it is a successful exit or an error case (in which case an exception would be thrown, but it would still be thrown). When translating this to C++, use of the RAII pattern is entirely reasonable (if you were going to Java, I'd say to use try/finally).