I wanted to contribute something I hope you find useful - a script which exploits automatic translation to find synonyms, then generates spun articles and provides spintax with embedding. As a noob, I can't post links, but I wanted to share the source anyway if you want to adapt or integrate it (disclaimer: I am not to be held responsible for my messy code).
HOW TO USE
- Install Python 2.7 if you don't have it.
- Install goslate (google it), which lets you use the Google Translate API for free in Python. I am not affiliated with its author.
- Copy the entire source into Notepad/Notepad++ and save it as a mt_spin.py.
- Save whatever content you'd like to spin in a single .txt file in the same folder (Unicode supported). Preferably name the file_to_spin.txt (see below).
- Run the script from the command line with the following prompt and parameters:
python mt_spin.py <file path> <languages> <number of copies> Where file path = path of the text file to spin, languages = languages uses for back-translation to predict synonymous phrases (use language codes separated by commas) and number of copies is the number of random spun versions that will be generated in a separate file, e.g.
python mt_spin.py file_to_spin.txt nl,se,es,it 5 FOR WINDOWS USERS: the script, goslate, and the file should all be located in your Python folder and you'll need to access it first to run the script. Alternatively, if you have a file called file_to_spin.txt in the same folder as the script, double-clicking the script will run it with a default set of languages, provided you have Python and goslate installed.
What languages should you pick for back-translation? Preferably ones that are widely used and have a syntax similar to English (Germanic and Romance languages should work considerably well). I recommend some combination of Dutch (nl), Swedish (se), French (fr), Italian (it), Spanish (es), Portuguese (pt), Turkish (tr), Indonesian (in) and Arabic (ar), they should work okay. More languages add variation but slow the program down and may produce less grammatical results. If you want speed at the cost of variation, go for a single language - Spanish (es) and Italian (it) both work quite well.
If you get bad results, simply experiment with a different set of languages for back-translation.
WHAT IT DOES
While running, the program should print lots of stuff giving an approximate idea of what it does - every sentence gets translated into selected languages and back, then the back-translations are aligned with the original to find alternative ways to express the same thing in the same context, and the resulting possible "paths" a sentence can take are represented by a simplified Finite State Automaton, which is then involved in spinning. The approach is somewhat inspired by NLP approaches to finding synonyms using translation databases, and yields more accurate and varied results than simply translating sentences back and forth between target languages.
Assuming the script runs fine, it should produce three files in the end:
- transspin_output.txt - Randomly spun text derived directly from the sentence model. The length of the file will depend on how many copies you request (the third parameter). You can find all the spun documents are dumped into this file and separated by headers, there's no header if there's only one document though.
- transspin_synonyms.txt - A list of predicted synonyms found while the code was running. I thought this could come in handy even if you want to spin stuff your own way, so I included it.
- transspin_spintax.txt - Spintax representation of the sentence model, featuring lots of embedding and weird spacing.
I DO NOT GUARANTEE the spintax (pt. 3) will work - it should in theory, but that's one of the things I'd like to get feedback on (I couldn't find a proper checker for embedded spintax). Finite State Automata and spintax are not exactly interchangeable, but I am not sure problematic cases ever arise.
Obviously, hit me back with any installation issues, bug reports, comments and suggestions you have. Enjoy.
SOURCE
""" jespersen 2013 automatic machine-translation-API-based multi-word embedded spin generator v. 1.0 """ import goslate import sys, re, random, codecs from collections import defaultdict """ the script uses goslate as a free Google translate API (I'm not affiliated with its author in any way) """ def split_sentences(content): """ splits text into a list of sentences not 100% accurate (has problems in the .001% of cases where punctuation appears inside parentheses), but actually better than most free sentence splitters out there """ """ list of abbreviations in text which may appear before a period, replaces them with a special string to transform them back again """ abbrevs = ('mr', 'mrs', 'dr', 'inc', 'co', 'vs', 'ex', 'e.g', 'i.e', 'ps', 'p.s', 'no') for abbrev in abbrevs: for char in (" ", "\("): content = re.sub(char+abbrev+r'\.', char[-1]+abbrev+'XABBRDOTX', content) content = re.sub(char+abbrev.capitalize()+r'\.', char[-1]+abbrev.capitalize()+'XYZDOTXYZ', content) """ actual regex """ matches = re.finditer(r'[^ ][^\?!]+?([\.|\?|!] ?)*[\.|\?|!][\)|\*|\"\']?( +[\)|\(])?\s', content+" ") last_stop = 0 sents = [] for match in matches: sent = match.string[match.start():match.end()] sent = re.sub('XABBRDOTX', '.', sent) sents.append(sent) last_stop = match.end() trailing = content[last_stop:] + " " if len(trailing.strip()) > 0: sents.append(trailing) return sents def zeros(shape): """ initializes a matrix - taken from numpy """ retval = [] for x in xrange(shape[0]): retval.append([]) for y in xrange(shape[1]): retval[-1].append(0) return retval def smith_waterman(seq1, seq2, insertion_penalty = -1, deletion_penalty = -1, mismatch_penalty = -1, match_score = 2): """ a simplified implementation of the Smith-Waterman algorithm for sentence alignment it's actually imperfect/buggy - initial unaligned elements in one of the sentences get sliced off - but it works for the current purpose (any other sequence alignment algorithm could work instead of this one, too) - adapted from Stack Overflow (post by Gareth Rees) """ DELETION, INSERTION, MATCH = range(3) m, n = len(seq1), len(seq2) p = zeros((m+1, n+1)) q = zeros((m+1, n+1)) for i in xrange(1, m+1): for j in xrange(1, n+1): deletion = (p[i-1][j] + deletion_penalty, DELETION) insertion = (p[i][j-1] + insertion_penalty, INSERTION) if seq1[i-1] == seq2[j-1]: match = (p[i-1][j-1] + match_score, MATCH) else: match = (p[i-1][j-1] + mismatch_penalty, MATCH) p[i][j], q[i][j] = max((0, 0), deletion, insertion, match) def backtrack(): i, j = m, n while i > 0 and j > 0: if q[i][j] == MATCH: i -= 1 j -= 1 yield seq1[i], seq2[j] elif q[i][j] == INSERTION: j -= 1 yield "", seq2[j] elif q[i][j] == DELETION: i -= 1 yield seq1[i], "" return [s[::-1] for s in zip(*backtrack())] def tokenize(sent): """ breaks the sentence down into words, also inserts an 'initial' dummy #i# for aligning sentence beginnings punctuation marks could also be separated at this point, but it causes more problems than it solves, so I don't do that """ return tuple(["#i#"] + sent.split(" ")) def contains(big, small): """ checks if a bigger list contains a smaller list BUT returns False when the lists are the same """ if big == small: return False for i in xrange(len(big) - len(small) + 1): for j in xrange(len(small)): if big[i+j] != small[j]: return False else: return True def construct_fsa(sent, back_translations, synonym_list=[]): """ constructs a pseudo-Finite State Automaton basically, the script: 1. aligns the original sentence with its automatic translation (for every translation) 2. finds 'alternate' ways to express a particular sequence when the preceding and following context is the same 3. initializes a dictionary representing the pseudo-FSA: the keys are states/nodes, the values are a list of transitions expressed as (label_string, target_node_int) initially, this corresponds to a transition between every pair of words in the original sentence, e.g. fsa[0] = [(("first", 1), 1)] <- from the beginning of the sentence (node 0), there is one path, adding the word "first" and going to node 1 4. the dictionary is expanded with every sequence of 'alternate' paths found in the translations (this is still in init_fsa) 5. the transitions found are shortened as long as the final word in the sequence corresponds to a transitional word in a state preceding the target, a shorter path is considered leading to that previous state instead (this is simplify_fsa) ...yeah, it's a mess, and there are probably more elegant ways of doing it, etc., etc. ultimately returns the FSA dictionary, where keys are states/nodes (int) and their values are sets of possible transitions represented as tuples (label_string, target_node_int) """ print "\n ORIGINAL:\n", sent.encode('utf-8') print "\n BACKTRANSLATIONS:" for i, bktransl in enumerate(back_translations): print i+1, " ".join(bktransl).encode('utf-8') print "" original = tokenize(sent) back_translations = set(back_translations) - set([original]) alignments = [] for sequence in back_translations: alignments.append(smith_waterman(original, sequence)) def original_nodes(original, sequence): original_node = {} old_i = 0 for i, word in enumerate(sequence): original_node[i] = old_i if word != "": old_i += 1 original_node[len(sequence)] = old_i return original_node def init_fsa(original, alignments): fsa = {i: [(tuple([label]), i+1)] for i, label in enumerate(original)} for s1, s2 in alignments: nd = original_nodes(original, s1) if sum([s1[i] == s2[i] for i in xrange(len(s1))]) > 0: start = 0 while s1[start] != s2[start]: start += 1 aligned = zip(s1, s2) else: """ if no elements can be aligned at all """ aligned = [] print "" for i in xrange(len(aligned)): to_print = "%-4i %-24s %-32s" % (i, aligned[i][0], aligned[i][1]) print to_print.encode('utf-8') print "" for i in xrange(start, len(aligned)): if aligned[i][0] != aligned[i][1] and (i == 0 or aligned[i-1][0] == aligned[i-1][1]): j = i while j < len(aligned) and aligned[j][0] != aligned[j][1]: j += 1 if nd[i] < nd[j]: """ because the nodes of the FSA currently correspond to transitions between words in the *original* sentence, words inserted in the back-translation which aren't aligned result in loops - this condition rules such cases out""" area = aligned[i:j] phrase1 = tuple([e for e in [w[0] for w in area] if e != ""]) phrase2 = tuple([e for e in [w[1] for w in area] if e != ""]) to_print = " %-36s = %-36s" % (" ".join(phrase1), " ".join(phrase2)) print to_print.encode('utf-8') """ let's save the synonyms somewhere while we're at it """ syn1 = " ".join([original[nd[i]-1]] + list(phrase1) + [original[nd[j]]]) syn2 = " ".join([original[nd[i]-1]] + list(phrase2) + [original[nd[j]]]) if (syn1, syn2) not in synonym_list: synonym_list.append((syn1, syn2)) fsa[nd[i]].append((phrase1, nd[j])) fsa[nd[i]].append((phrase2, nd[j])) return fsa def simplify_fsa(fsa): new_fsa = {} print "\n PRELIMINARY FSA:\n" for key, values in fsa.iteritems(): print key, list(fsa[key]) for key, values in fsa.iteritems(): values = set(values) new_values = [] to_remove = [] for i, transition in enumerate(values): if transition[1] < len(fsa.keys()) and len(transition[0]) > 0: t2 = transition while key != t2[1]-1 and (t2[0][-1],) in [val[0] for val in fsa[t2[1]-1]]: t2 = (t2[0][:-1], t2[1]-1) if len(t2[0]) == 0: break new_values.append(t2) else: new_values.append(transition) new_fsa[key] = set(new_values) print "\n FINAL FSA:\n" for key, values in new_fsa.iteritems(): print key, list(new_fsa[key]) return new_fsa return simplify_fsa(init_fsa(original, alignments)) def get_back_translations(sent, service, languages): """ returns a list of back-translations of the original sentence by translating it into a target language and back using a machine translation API I use goslate - a randomly found awesome free Google Translate API with some modifications, you can replace it with any automatic translation that can return unicode translations of the original sentences """ back_translations = [] for language in languages: transl = goslate.translate(sent, language) back_transl = tokenize(goslate.translate(transl, 'en')) back_translations.append(back_transl) return back_translations def generate_sent(fsa): """ picks random possible transitions between states until it reaches the end, stringing a sentence together in the process the generation is completely naive - it has no memory of paths taken before, so it's prone to repetitions, etc. """ words = [] target = 0 while target in fsa.keys() and len(fsa[target]) > 0: label, target = random.choice(list(fsa[target])) words += [w for w in label if len(w) > 0 and w != "#i#"] new_sent = " ".join(words) return new_sent def generate_spintax(fsa): """ attempts to generate 'spintax' (with lots of embedding) for the FSA representation of a sentence not seriously tested for quality - may potentially generate buggy representations because FSAs aren't translatable into spintax """ opening = defaultdict(list) closing = defaultdict(list) for key, values in fsa.iteritems(): transitions = sorted(values, key=lambda x: x[1], reverse=True) for label, goal in transitions: if label != ("#i#",): opening[key].append(label) closing[goal].insert(0, label) linear = [] for key in xrange(len(fsa.keys())+2): linear += [(t, 1) for t in closing[key]] + [(t, -1) for t in opening[key]] to_print = "" for i, item in enumerate(linear): if item[1] == -1: if linear[i+1] == (item[0], 1): to_print += " ".join(item[0]) + " " else: to_print += "{" + " ".join(item[0]) + " |" elif linear[i-1] != (item[0], -1): to_print += "}" """ simplify preliminary representation by turning instances of {{x|{y|z}} to {x|y|z} """ match = re.search(r'\|{[^{]+?}}', to_print) while match: to_print = to_print[:match.start()+1] + to_print[match.start()+2:match.end()-1] + to_print[match.end():] match = re.search(r'\|{[^{]+?}}', to_print) to_print = re.sub(r' +', ' ', to_print) return to_print class SentenceToSpin: """ initialize a class for every sentence to be spinned based on the original sentence, translation service, and languages used""" def __init__(self, sent, service, languages): self.original_text = sent self.back_translations = get_back_translations(self.original_text, service, languages) self.synonyms = [] self.fsa = construct_fsa(self.original_text, self.back_translations, self.synonyms) self.spintax = generate_spintax(self.fsa) def get_random_spin(self): return generate_sent(self.fsa) class DocumentToSpin: """ initialize a class representing the whole document, storing its structure and sentence representations based on file content, translations service, and languages used content = all plain text of the file""" def __init__(self, content, service, languages): self.content = content self.paragraphs = [] for par in self.content.splitlines(): sents = [SentenceToSpin(sent, service, languages) for sent in split_sentences(par)] self.paragraphs.append(sents) def main_function(source_document_path, service, languages, copies): """ MAIN SCRIPT TO RUN analyze document, create FSA representation for every sentence, then save synonyms, spun text, and spintax into separate files source_document_path = path of the document to use service = translation API (see get_back_translations) languages = languages to translate the English text into and then back into English copies = how many spun copies to generate in the output file """ copies = max(1, copies) copies = min(256, copies) output_synonyms_path = "C:/Python27/transspin_synonyms.txt" output_spun_text_path = "C:/Python27/transspin_output.txt" output_spintax_path = "C:/Python27/transspin_spintax.txt" print "\nAnalyzing phrasal synonyms in text...\n" with codecs.open(source_document_path, 'r', 'utf-8-sig') as file: document = DocumentToSpin(file.read(), goslate, languages) """ the 3 functions using DocumentToSpin below could be merged, but I kept them separate for readability/editability """ print "\nSaving synonymous phrases (%s)..." % output_synonyms_path with codecs.open(output_synonyms_path, 'w', 'utf-8') as synonym_file: for par in document.paragraphs: for sent in par: for pair in sent.synonyms: synonym_file.writelines(pair[0] + "\n" + pair[1] + "\n\n") print "Saving a spun version of the file (%s)..." % output_spun_text_path with codecs.open(output_spun_text_path, 'w', 'utf-8') as spin_file: if copies > 1: spin_file.writelines("%%% version no. 1 %%%\n\n") for i in xrange(copies): for par in document.paragraphs: line = " ".join([sent.get_random_spin() for sent in par]) spin_file.writelines(line + "\n") if i != copies-1: spin_file.writelines("\n%%% version no. " + str(i+1) + " %%%\n\n") print "Saving the uncertain spintax generated (%s)..." % output_spintax_path with codecs.open(output_spintax_path, 'w', 'utf-8') as spintax_file: for par in document.paragraphs: line = " ".join([sent.spintax for sent in par]) spintax_file.writelines(line + "\n") """ when the script is run from shell... """ if __name__ == "__main__": if len(sys.argv) == 4: goslate = goslate.Goslate() source_document_path = sys.argv[1] languages = sys.argv[2].split(",") copies = int(sys.argv[3]) main_function(source_document_path, goslate, languages, copies) elif len(sys.argv) == 1: goslate = goslate.Goslate() source_document_path = "C:/Python27/file_to_spin.txt" languages = ['nl', 'se', 'es', 'it'] copies = 5 main_function(source_document_path, goslate, languages, copies) print "\n(no arguments were specified, used default filename and language set)" else: print "\nWrong number of arguments!\nCorrect usage:\n" print "python mt_spin.py <file path> <lgs for back-translation> <copies to output>" print " e.g. python mt_spin.py file_to_spin.txt nl,se,es,it 5" sys.exit() EXAMPLE
It truly is difficult to over praise this game, and I can't conceive of many gaming fans being disappointed in this one.
--- using nl, se, es, it --->
It {'s really hard |{truly |really }is difficult }to { |over }praise this game, and I {can not imagine that |{can't |can not }conceive of }many {gamers who were |{game enthusiasts to be |gaming fans {are |being }}}disappointed in {this. |this one. }
0 comments
Posts a comment