The following post is intended to be the first of many methodological/experiential ruminations about researching specific VR-related phenomena. Most of the terminology employed will be adapted from key microsociological works in related areas. For a primer, see Suchman (1987) and Goodwin (2017), along with Sacks (1995).

All of the posts are meant to be highly preliminary and speculative. I feel that this is useful for people who want to join (and maybe redirect) the conversation about microsociology in virtual reality.

Gestural Work in ‘Arizona Sunshine’

One of my recent forays into virtual reality was a cooperative multiplayer-session in ‘Arizona Sunshine’, a VR-game where you navigate a post-apocalyptic zombie-infested landscape in search of supplies and survivors. Much of the game involves searching cupboards, car-trunks and (hopefully!) abandoned buildings for guns, ammunition and health kits. It also involves shooting zombies in the head, naturally.

As this is a virtual reality game, you play it by being strapped into a virtual reality headset (the Vive Pro, in our case). The Vive accurately tracks the headset in 3D-space (if I move my head in real life, the same movement occurs in virtual reality, with a corresponding change of perspective). Additionally, it tracks two controllers (one for each hand) which serve as the main input mechanisms for most virtual reality games. I will make a post about the specifics of our lab’s VR system in a separate post.

In this session, I was joined by an unnamed second player from the United States. They seemed to be much more knowledgeable of the gameplay mechanics than I was, so I effectively had a kind-of tour guide who would point out locations of various items and instruct me to perform certain actions. On more than one occasion, the other player also saved me from being eaten by zombies – but I digress.

In all our interactions, we both had to employ the at-hand resources provided to us by a unique confluence of technological affordances and conventions. Before demonstrating this via the central example of this post, let us discuss the specific technological limitations/opportunities:

Figure 1. My first-person perspective. The large objects on both sides are the virtual guns in my hands.

As can be seen in the above screenshot, both characters (usually) have guns equipped in both hands. Things can still be grabbed with the guns in hand, but the guns themselves tended to be held at all times. Guns can be pointed in any direction and be used to achieve quite a bit of gestural work. Another mutually visible in-game element is the head-tilt. Even though the eyes are not tracked in this game, the position of the head is tracked by the VR headset. Another specificity: though the game allows players to communicate through speech, the other player’s microphone was muted in this instance. This circumstance created the necessity to ‘get by’ solely through the use of non-verbal resources.

For the above reasons, most of the interaction between me and the other player occurred through a combination of head- and hand movement, with the specific restriction that both hands were occupied by guns in most cases. Given these conditions, one could have assumed that very little happened between us, interaction-wise. In actuality, these resources were more than enough to provide a scaffolding for procedurally-generated ad-hoc conventions:

The head

Coordinated nodding was used to accomplish mutually signaled agreement or understanding. This agreement could be supported by hand gestures if necessary. The position of the head was used to maintain a mutual focus of attention on a particular area of the game world, or coordinate a shift from one area to another. The reciprocation of gaze could form a temporary participation framework, or disrupt it by the visible disattendance (‘loss of focus’) by any participant.

The ‘gun hands’

Compared to the physical world, Arizona Sunshine does not track fingers – the controllers are effectively monodirectional sticks, and are tracked as such. This limitation has the consequence of reducing the gestural repertoire to a linear pointing: the guns can be oriented in any direction in 3D-space, and be moved through it without restrictions. Thus, it was impossible to do a ‘victory sign’, a ‘thumbs up’, a‘middle finger gesture’ through the usual combination of individual finger arrangements. This does not, however, mean that gestures could not be developed. After all, the temporal-historical movement of sticks through space, coordinated and witnessable in-situ, allows for a rather complex sequence of constellations (a conductor being an obvious example; note to self: look up research on cheerleading conventions).

Arcs could be drawn between a current head-tilt focus and an intended target. The ‘sticks’ could be waved, they could track movement, and they could be used to express agreement or disagreement. One curious convention for ‘thumbs up’ was developed in the current example: at certain points I had the need to express something like a cheer or a ‘yeah!’, so I used a rapid ‘air pumping’ movement of both hands (alternating or synchronized):

Similarly, disagreement or ‘no’ (the actual use of the gesture was, naturally, not predetermined by a convention) could be achieved by crossing the arms or waving the guns side-to-side in a synchronized fashion. At one point (sadly, the recording was corrupted), the other player performed an ‘opinion asking’ by taking two masks (one in each hand) and doing a ‘weighing’ motion, which made it possible for me to point out the mask I preferred.

Everything I’ve said so far could have been interpreted as scenarios where interaction was somehow limited. I’m not sure whether that’s a useful way to look at it: we do not consider physical interaction as ‘interaction under conditions where we don’t have four hands’. We coordinate action with whatever is within actionable grasp. If one resource is not available, perhaps there are others that can do similar work. Or perhaps the nature of the interactional work itself may be adapted to the resources. An interesting case of the latter is the use of the guns for action-at-a-distance: as was mentioned previously, the other player did not have an active microphone.

Coupled with the fact that we often were several dozen ingame meters apart, we were faced with the difficulty of gaining the attention of the other player. This issue was solved by shooting the other player in the head.The shot caused the player’s screen to turn red (in addition to the audible cue) and accomplished a shift of attention without getting closer to the player. This telecommunicative convention was particularly useful for the ‘quest’ part of the game. During certain stages, quest-relevant items had to be retrieved from a particular spot (as can be seen in the following video, a key was located in car). As such, it was useful to have a means of ‘summoning’ the other player rather than leading them, step-by-step, to the specific spot. Headshots accomplished that. In short, here we have a convention that made creative use of ingame mechanics originally designed for other purposes.

Disambiguating instructions

Perhaps the most interesting case of complex co-operative action occurred in the following video. The first part shows ‘gun-based long-distance communication’, but the ‘meat’ of the video comes later.

Some background: Arizona Sunshine allows players to don masks which are scattered throughout the game world. In order to put on the mask, a player has to grab it and drag it onto their own face (mimicking the way a real-life mask would be put on). If the mask is close enough to the face when it is released, it automatically equips. Earlier in the session, I actually found (and equipped) a mask without issue. When we entered a house, the other player changed their mask, discarding the old mask onto the floor. Having forgotten that I was already wearing a mask, I decided to put on the mask the other player dropped. This is, however, not possible through the same set of movements. In order to put on the mask, I had to first unequip the previous mask by grabbing my face, dragging away the mask, dropping it, then putting on the new mask.

Watch the video to see how the other player assists me in figuring out how to put on the new mask. One of the primary difficulties is the fact that the sequence cannot be simply shown to me, as it looks similar to the process of putting on a mask for the first time. In order to disambiguate the message ‘this is how you put on a mask’ and ‘remove the old mask, then put on the new mask’, a lot of work needed to be done, particularly considering that the fact of my mask-wearing was visible to the other player but not myself.

This sequence alone could be the focus of an entire research endeavour, with particular attention being played on the exact sequentiality of gestures, nods, uses of props and manipulations of a continuously morphing public substrate. The latter term warrants some explanation. I highly recommend reading the first chapter of Goodwin (2017) in its entirety, but here’s the gist of it:

This process of building something new through decomposition and reuse with transformation of resources placed in a public environment by an earlier actor is what I am investigating as co-operative action. As all of the materials to be examined later in this book demonstrate, it is pervasive in the organization of human action. The ability of this process to endow human history with its unique accumulative power, that is, as something progressively shaped by a consequential past, while remaining both contingent and open-ended, is captured by Merleau-Ponty’s (1962:88) observation that “history is neither a perpetual novelty, nor a perpetual repetition, but the unique movement which creates stable forms and breaks them up.” For simplicity he term substrate is used to point to the earlier utterance, or another kind of sign complex (a hopscotch grid, for example), that is the focus of transformative operations being used by another actor to create a next action.

(Goodwin, 2017, p. 3, emphasis mine)

I actually tried to do a gloss-like description of the whole procedure just now, but it turned out to be far too complex, coordinated and sequentially unfolding to be a useful exercise. What’s necessary is a detailed multimodal transcript to figure out what was visibly being done, possibly through ELAN or something more Goodwinian.

I am currently interested in gathering much more data (preferably with different players) on how at-hand resources are created, employed and modified. I am not certain that the specific set of conventions (such as the ‘yeah!’) will emerge as typical resources for most players, particularly when one considers constellations where all players can use the resource of verbal communication. I will gather more recordings and will keep you updated with additional ruminations. At the moment it seems to me that Goodwinian multimodal analysis might be a good framework for actual analytic work, especially since he has had a historic focus on (for lack of a better term) ‘limited’ interaction. It would be interesting to see whether particular kinds of interactional restrictions lead to an increased ‘mutual action-inhabiting’ and substrate-building.


Goodwin, C. (2018). Co-operative action. Cambridge University Press.

Jefferson, G. (Ed.). (1995). Harvey Sacks: lectures on conversation. Blackwell.

Suchman, L. A. (1987). Plans and situated actions: The problem of human-machine communication. Cambridge University Press.