Casual Procgen Text Tools

Last Thursday I was at at the PCG-meets-autotesting unconference at Falmouth, which organized into a bunch of work-groups to talk through ideas related to the conference theme. This was a really fun time, and I am grateful to the organizers and my fellow guests for making it so intriguing.

Our morning work-group started with a suggestion I had: what if there were a casual text-generation tool like Tracery, but that provided a similar level of help in assembling corpora for leaf-level node expansion? What would help new users learn about selecting and acquiring a corpus? What would help them refine to the point where they had something they were happy with using? (And for that matter, are there applications of this that we could see being useful to expert users as well? What could such a tool offer that is currently difficult to do?)

This idea sprang from some of my own discovery that I spend a lot of my own procgen development time simply on selecting and revising corpora. What will productively add to the feel and experience of a particular work, and what should be excluded? How small or large do the corpora need to be? Is there behavior that I can’t enforce at the grammar level and therefore have to implement through the nature of the corpus itself? (I talk a bit about those concerns during my PROCJAM talk (video, slides), especially under the Beeswax category.)

We had a great conversation with Gabriella Barros, Mike Cook, Adam Summerville, and Michael Mateas. The discussion ranged to cover a number of additional possibilities, some of which went considerably beyond the initial “naive user” brief here.

Existing corpus resources

We talked about where one can easily find corpora already, if it turned out that there was material available that could be usefully plugged into a tool.

Mentioned in this conversation: Darius Kazemi’s github corpora resource, containing lots of user-contributed corpora set up in JSON. DBpedia. ngrams as a source of common word pairings, or a way to find adjectives that are known to go with a particular noun/type of noun.

Scraping new corpora

What data sources are on the web that one could imagine building an auto-scraper for?

This is an area where Gabriella has a lot of experience, because much of her research is in games that make use of external data. (She spoke about those games the next day during the PROCJAM talks, which means that you can see her introduction to them on this Youtube video.)

Mike has an existing tool called Spritely that is designed to look for images on the web that are isolated enough to use, then convert them into a sprite-style format. We talked about whether something similar could be used for pulling in text materials with particular associations.

Mixed-initiative corpus-building

A mixed initiative tool is one in which the computer and the user both contribute to the creative output, sometimes building on one another’s work. (Here’s a great chapter about different approaches to mixed initiative by Antonios Liapis, Gillian Smith, and Noor Shaker, which outlines a lot of different possibilities.)

What would a mixed initiative tool look like for corpus generation? One possibility would be something where the user typed in some words and the system came back with a list of possibly related words that the user could then choose to add to the corpus or not.

Google Sets used to provide this service, but it’s apparently now no longer available.

Adam suggested that we might look at tools based on word2vec datasets: for instance, wordgrabbag is able to find words that are proximate in the vector space to the words the user suggests,

Meanwhile word2vec playground completes analogies based on user input. This is hugely fun to play with. Some sample output from that, which I enjoy because they make a sort of sense without being entirely predictable:




and, okay, the ethanol one is a bit odd. But part of the fun of mixed initiative systems is that they offer the creator options she likely wouldn’t have thought of in the first place. Besides, we could also imagine corpora that involved groupings of words, or words plus tags, as well as individual words.

An I-Feel-Lucky corpus-scraper

We speculated about a variant where you could specify the general type of list you wanted (e.g., “a list of books”) and then the corpus tool went off to wikipedia and came back with one of several possible lists of books there, such as the List of Books Written By Teenagers, or the List of Books Related to Hippie Subculture. (Sadly, the List of Books about Japanese Drums link just led to a generic article about Taiko and didn’t feature that much of a bibliography after all.) The idea again would be to surprise and delight the creator as well as the eventual reader.

To aid this discussion, Gabriella introduced us to the wikipedia List of Lists of Lists, which is one of the most pleasingly meta things I have seen in a long time.

Filtering and editing corpora

Another idea we batted around was of pulling a corpus of words with a lot of associated tags and then letting the user turn on and off subsets of the corpus. This would apply not just to removing offensive terminology, but perhaps to other purposes as well. (How the words would be automatically tagged in the first place is also a good question; perhaps via WordNet, ConceptNet, or information derived from word2vec or some other method.)

We talked about being able to generate corpora that were prefiltered for diction level (formal? slangy?) or historical period (e.g., “a list of vehicles appropriate to 1830”). We also raised the possibility to filter words by sentiment rating, but sentiment analysis is not always particularly reliable or high-quality, so I am not sure what I think of the plausibility of this. On the other hand, having that setting in the tool might contribute to teaching users about the limits of sentiment analysis! So there’s that, perhaps.

Additional controls on grammars

Here we went a bit outside the lines of just talking about making a good corpus, and got into a conversation about other ways to put controls on the tool. The strategy here would be to allow the grammar to overgenerate — create more material than was needed, and create some material that wasn’t suitable — but be able to specify some constraints on that material after generation so that unsuitable things would be discarded.

Here we talked about ideas like nodes that could be marked up to produce, for instance, alliterative output. (Later in the weekend Adam showed me a project he’d put together where this was actually working on top of Tracery, to let one create Tracery projects that enforced alliteration. But I’ll let him link that project if he wants to share it with others.)

We also talked about controls that would apply to an entire sentence or paragraph of generated output, if we wanted to control for qualities or behaviors that would only manifest themselves at a macro scale.

So for instance, suppose you wanted to have a paragraph that was guaranteed to demonstrate varied sentence length. You could do this by making the grammar go sentence by sentence, remember the length of the last sentence, and try to get a different-length sentence this time. (This is what Savoir-Faire does with its sentence generation about thrown and dropped objects: it has a concept of short, medium, and long sentences, and tries not to make the same kind of sentence twice in a row.) But this can be laborious and a little clunky; and sometimes it might simply be impossible to generate something that corresponded to requirements.

Alternatively, you could have the grammar generate a lot of paragraphs without doing any particular memory or control of particular sentence, then select after the fact for paragraphs that qualified as sufficiently diverse; and you could do this with a machine-trained classifier that was able to apply fuzzier requirements to the output — making it more likely that you would get some match even with an incompletely populated grammar, and that there would be more variety from one output to the next.

Another idea I really liked (though I haven’t written down who initially proposed it) was the idea of a probability curve that you could apply to generation over the course of a whole sentence or paragraph. This idea arose out of some of my ideas about the distribution of surprise in generated text (for more on which, see my PROCJAM talk, and the concept of Venom in Annals of the Parrigues). But the idea was that the user might be able to specify a curve — perhaps low at the beginning, then gradually rising over the course of the paragraph; or perhaps presenting several distinct peaks — that would determine how likely the system was to choose a grammar element with a particular stylistic feature. (Being “surprising,” statistically rare, offensive, high or low diction would all count here.)

Finally, Adam raised the possibility of running a grammar a number of times while keeping track of which nodes were expanded, classifying the resulting text (e.g., is this output paragraph Hemingwayesque based on a machine-taught stylistic classifier?), and then using that information to build in percentages so that the generator would know how often to use expansion X rather than expansion Y when generating a Hemingway-style paragraph. (Essentially building the results of a Monte Carlo tree search back into the generator for future reference, as I understand it.)

Transformational grammars on top of generative grammars

We talked a bit about the concept of the transformational grammar and whether it would be useful to introduce some transformational grammar tools on top of the generative ones. (Later in the weekend Joris Dormans’ dungeon-generation talk came back to the concept of transformational grammars, but in a rather different context.)

Tools and visualization

Someone floated the idea of a “rich text Tracery”: one in which you could affect the likelihood of a particular corpus element being selected, or associate tags with it, by changing the font size and color of the entry. (I proposed that in a corpus of mixed modern and archaic words, the archaic words could be rendered in a flowy handwriting script. This is probably silly.)

I also shared some of the ideas from my previous posts here about visualization of procedurally generated text and about notifying the user when added corpus features actually reduce rather than improving the player’s perception of output variety.

That talk about experienced variety also led someone to mention a method by Tom Francis, in which a game starts with a generative system but then over time additional generative features are unlocked. (This is done purely on the basis of time spent playing, not on whether the player has succeeded at something or unblocked a new checkpoint.) The idea  is to let the player get to where they feel they fully understand the range of output possible in the grammar, and then surprise them by demonstrating that there’s still more. This in turn reminded me of Rob Daviau‘s work with Legacy boardgames that introduce new mechanics over the course of repeated playing.

Other resources

This PCG book has chapters online about a lot of current research and possible tools.

Tony Veale has done a lot of work with computer-generated metaphor and analogy.

The slides from Adam’s talk contain some great information about different machine learning algorithms and their uses, including when it comes to text.

This workshop revealed incidentally that wikipedia features a List of Knitters in Literature. I just needed to share that.

Lost thoughts

My notes also contain one or two lines about training an LSTM on a large corpus and then testing the output of a generator to see whether the generated sentence was probable or not probable. I can’t recall what challenge this solution was supposed to resolve, but possibly someone else from the workshop will find it jogs their memory.

On a later page I also have the phrase “tagged corpora trained bidirectional LSTM sequence to number” but I’m also not certain what that was about. It’s in green ink, though.

Finally, the notes stress the important of QA-testing a large generated space. They do not suggest a solution to this problem.

7 thoughts on “Casual Procgen Text Tools

  1. Sounds like a cool workshop! Speaking of unique corpora to comb through, I wonder if you have come across the Enron Corpus, a massive collection of >1.7 million emails (originally seized from Enron by federal investigators) which is now in the public domain. It’s since been used for a variety of research aims in linguistics and computer science, including the following study of how phrases signal workplace hierarchy which I remember reading about and being fascinated by.

    I wonder if a massive corpus like this (with numerous repeated email interactions between many actual individuals) could be mined for ways to procedurally generate plausible written interactions between people.

  2. The idea about probability curves and surprise–is this different from a Markov model? Jason Hutchens did some interesting work using Markov models that emphasize surprise for writing chatbots, back when he was still entering the Loebner contest. And, of course, Markov chains power a lot of popular twitter bots drawing from preexisting corpora (like a politician’s speeches, or an author’s complete works).

    • It is different, yeah — the idea is that these curves would be influencing the expansions chosen by a grammar, so that you can guarantee a grammatical and sensible output. Markov models work with probabilities of certain words occurring next, but that doesn’t mean that the generated sentence remains coherent for more than a few words at a time. (For anyone else reading who’s curious, here’s a writeup of Hutchens’ MegaHAL project.)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s