Recently read an interesting article by some Microsoft playtesters that suggests playtesting studies using 25-35 participants focusing on a single hour of gameplay, followed up with standardized surveys. The idea is that this could be done repeatedly during the course of a game’s development in order to drive gameplay improvements and then confirm that the changes have had the desired effect. This method contrasts with usability tests (an hour to two-hour interview one-on-one with testers, usually conducted with a group of eight or so) in that it is more statistically reliable though not so in-depth.
This is something I’ve been thinking about a lot lately.
Aaron Reed’s research on novice gameplayers of “Whom the Telling Changed” was extremely useful, especially his analysis of what errors players were most likely to make. Reducing these errors substantially is an obvious goal, and the whole IF community benefits when these problems are identified and the tools improved to address them. Presumably further testing might be of further use, especially if it’s focused on making tools and libraries that better handle areas that traditionally trip up new players: learning the interface and communicating with the parser.
In the short term, I’m concerned about the refinement of the conversation system. I’ve been fortunate in that there’s been a lot of feedback about Alabaster already, and that has suggested some directions of improvement — most especially in the form of hinting to the user about how to use abbreviations and to remind him that it’s not necessary to retype the whole command verbatim. But I’m interested to implement these improvements and then see what people make of the results.
In the longer term, a handful of adjustable-default games exist to allow people to explore their preferences about default IF behavior such as narration voice and tense, compass directions, etc. It might be interesting to have feedback on some of these from a larger group composed of IF novices.
What I imagine based on the description here is: get access to a computer lab (possibly at a college or university). Install some game on all of the lab computers, and set the games to keep transcripts automatically. Have roomful of participants play for roughly an hour, then fill in a survey about their experience (using some sort of ID tag to associate survey results with the specific transcripts, ideally).
So the question: what are other resources that describe best practices for playtesting projects? What sorts of pitfalls should one be watching out for? How much would each playtester need to be paid for an hour’s work to make it worthwhile participating? (I imagine if the players were themselves college students the answer might be $10-$20, but for adults…?) I realize I’m not trained in the field of usability testing and that there’s a huge amount to know — but given that hiring a consultant is probably not in my budget, is there an amateur approximation that would be of some use?
You could find a usability tester, explain that you want to apply usability testing to IF, and ask what that would involve. Most professions have a process for educating people about what they do, and (don’t worry) a clearly defined line as to when a conversation becomes a payable consultation.
Same idea, times twenty: you could conduct an informational interview about applying usability testing to IF, and put it on your blog.
I’d imagine the basic idea would be that you have the ideal behavior of the user at the time of design; you collect samples of actual behavior; and you examine deviations. Similarly for the user’s reported level of fun.
I think it’s a great idea, which could really benefit our understanding of how people play text games. I’d think a small number of testers would do it, especially in the early stages, while you’re looking for what you’re looking for.
Conrad.
College students work for pizza. :)
Usually it’s $20/hour for usability participants when you recruit them off the street. Pizza works pretty well with students, too, given < 1 hour play time.
I would also look seriously into doing some screen recording of participants. I imagine transcripts would give you a good overview of the play experience, but getting a sense of where a player hesitates would provide deeper insight. Even better if you can have a moderator sitting next to them to ask them to think aloud when they get stuck, and record audio or take notes on this. Intent is hard to get users to express without prompting — especially when they're unsure about what they're doing.
The article I quoted here seemed to be arguing that while that kind of intensive usability testing is very helpful (recording behavior, one-on-one moderation, etc), it’s also very expensive to perform, and *for some purposes* a larger group — even if less closely scrutinized — can give better feedback.
This could be wrong, but that’s the place I was starting from here.
I think as far as being able to quantify whether one design is better than another, the playtest model seems workable. To figure out what you should do to make a better design, though — I agree with Jill below, I wouldn’t trust an after-the-fact survey to give you the whole picture.
I think usability testing gets a bad rap, to a degree, about its expense. It’s true that if you want to run a full-fledged study, it does take a lot of time and money. But you can run small-scale tests on the cheap yourself. I’d recommend googling ‘guerilla usability’ for some ideas — there are also some books I could recommend if you’re interested.
Thanks for the tip about “guerilla usability” — in particular the existence of the Silverback application makes this sound like something I could actually do with my laptop, rather than a complicated AV project where I’d have to rent multiple video cameras and whatnot.
I once read in a game developer’s blog that, when it comes to recording data during testing, what’s really important is *not simply asking* the testers after they play what they thought and felt, whether it be verbal or a written survey. Even if they don’t mean to, they usually will—inadvertently—distort and abstract the data you want into a simplified, much less rich (maybe even false!) impression.
What’s much more ideal is monitoring the tester’s reactions: when she becomes happy, engrossed, or frustrated. Valve famously records and analyzes live data about their testers. It then revises or even mercilessly cuts out every instance at which a significant portion of testers becomes stuck at.
So I’d consider—along with interviewing or surveying the testers afterwards—*video-recording* their faces and analyzing them in concert with their actions on the screen. Unfortunately, this would probably take a much larger amount of time. I’d suggest in return reducing the number of testers. In my opinion, having a smaller pool of testers with richer data from each one is more preferable to having a large pool with shallow data.
Also, rather than having a large pool, I suggest trying to get a small but *diverse* pool of testers too to try to offset the sample size’s bias.
Upside to doing it at a University: if you have somebody who does anthropological studies you can likely get pointers, if not grad students looking for projects / experience. The chunk of this work done alongside my dissertation happened in an Information Studies department.
Downside: Institutional Review Board. You should be exempt, but you need to get the IRB to agree that you’re exempt, which will at least require filling out 4-5 pages of paperwork and fully documenting your study protocol and intended analysis before you begin. (http://www.hhs.gov/ohrp/irb/irb_chapter1.htm section A) Although the IRB at your institution *may* not *technically* have jurisdiction over your work, they’ll probably assert it.
(When I sat on an IRB, we even insisted that students taking oral histories had to get IRB approval, just to make sure we were clear of federal regulations. There are some guidances from the NIH that imply an IRB may not need to be so all-encompassing, but they’re unclear enough that most IRBs seem to monitor all research with human subjects.)
In survey design, usually the pollsters start out with a small sample who they ask open-ended questions. The idea is to get a broad (and statistically fuzzy) view.
That informs survey design, in which hard data is collected from a much larger group.
C.
ps – On Wikipedia, somebody famous claims that a 5-person test group will spot 95% of your problems.
As noted above, you can frequently get college students in exchange for pizza, and $5 per student will buy you a lot of pizza. You’ll get even more (but risk self-selecting in unhelpful ways), if you say, “Free pizza in exchange for playing a video game for an hour.”
If you can convince the local psych department to go in on the research (surely something about human-computer interaction in english is interesting to psych professors!), you may be able to do it even cheaper. Intro psych students, at least here, are required to participate in at least one study. Given that most of the studies are in the form “do mindless repetitive work”, offered the chance to “play video game for an hour” I think you’ll get a lot of takers.
The ‘human-technology interaction’ (HTI) group at Eindhoven University of Technology (Netherlands) recently started with ‘VLAB’, a ‘virtual laboratory’.
Once someone is registered on the VLAB-website, he/she can be invited for various experiments, all performed at their own home, using the web browser. Per experiment, about € 3 to € 5 is sent to the person’s paypal-account.
Limiting the experimental setting to the person’s own home or office and PC is certainly a serious constraint for research, but if the system succeeds, it might be easier to achieve experiments with larger groups of test subjects. I know that the HTI-group often has problems finding people for the experiments.
The website is, unfortunately, only in Dutch: http://w3.vlab.nl/