Arizona Sunshine and the Question of Monomodality

Over the course of the project, a number of dedicated topics worthy of separate research have emerged. Each of these topics, given enough data-analysis, literature reviews and, most importantly, conceptual work, would be candidates for publications. I would like to use this space to talk about some of these ideas in their outlines, and as they are developing. I’ll try to add relevant literature for each topic. Due to the relative length of the first topic, it was decided to split these reports into separate posts.

Read more

This is the oldest and arguably most developed line of inquiry (for the introductory blog post about this, click here). Virtual reality allows for many different constellations of differently-abled interactants to meet in the same space. With Arizona Sunshine, you oftentimes meet ‘mute’ players without microphones who only have gestural input and body tracking available to them as communicative tools.

In the original blog post, I talked about how difficult it was to communicate “Hey, you are wearing a mask. In order to put on a new mask, you need to remove the mask you are already wearing.” without any words. There seemed to exist limitations peculiar to gestural modalities, since the sequence ‘put on mask’ (grab mask, hand towards face, grab button release, hand away from face) was difficult to disambiguate from ‘remove mask’ (hand towards face, grab button press, hand away from face, grab button release):

  1. ‘grab button press’ and ‘grab button release’ are only different in their consequence, but do not register as visible events for other players.
  2. The initiation of ‘remove mask’ from the ‘home position’ of the player’s hand adds further ambiguity, as there is no apparent method of separating ‘sequence preparation’ from ‘sequence start’. Thus, the hand moving towards the face may either be interpreted as the first step of ‘how to put on a mask’, or the preliminary step towards the initiation of ‘how to remove a mask’.

More fundamentally, however, there seems to be a difficulty of using gesture sequences to point at proceeding gesture sequences. I.e. it seems to be nontrivial to do ‘the thing you are doing right now is incorrect, do this instead’ with gestures only. This property originally lead me to the avenue of exploring the specific indexical qualities of speech as opposed to non-speech interaction. With words, it is relatively easy to point at ongoing action. Gestural sequences, on the other hand, seem to be employed in a sequential – rather than simultaneous and mutually-pointing – manner, which has a notable transformative effect on how interaction unfolds.

However, over time, I’ve come to reconsider this issue in more general terms. The problem of ‘pointing at ongoing action’ (“stop doing this, do this instead”) is not a peculiar characteristic of speech, but the specific multimodal ecology of an encounter. Goodwin’s analysis of Chil’s rich interactional toolset (in the near-total absence of complex self-produced language) is an illustration of non-verbal indication: we can use prosody, gestures, gaze, body-position, etc. to actively disengage, engage or point at ongoing sequences of action. Chil could, by literally pointing at speech produced by other people, say ‘this’ and ‘not this’. Using rich prosody, he could take an active part in the co-operative production of ‘I don’t want toast, but I want something similar to it’.

In other words, the property of pointing at ongoing sequences of action, and the complex mutual accomplishment of action enabled thereby, is not a characteristic peculiar to spoken human language. Put differently, the difficulty of pointing-at-sequential-unfolding-action seems to be rooted in something that is not immediately related to the specific modality. This brings me to my current preliminary hypothesis: the issue is not so much the restriction imparted by a specific semiotic register. Rather, the issue is the presence of multiple registers, or the absence of it. In a world of only sequential gestures, it becomes much more difficult to say ‘stop doing that’, as the doing-‘stop-doing-that’-ness has to come after the completion of the other interactant’s sequence, which makes it more difficult to treat the ongoing sequence as a public substrate available for modification and reuse.

Even more generally, this leads us to the question of monomodal/bimodal types of interactions, i.e. spaces where our pointings – and thereby the capacity for substration – are limited: is the definition of modality, semiotic register, public substrate, interactional ecology, etc. rigid enough to afford an analysis of spaces with reduced modalities? More than that: is it even possible to convincingly postulate that monomodal spaces can exist, for all practical purposes? After all, Arizona Sunshine, even in the absence of voice input, doesn’t force users to use a single semiotic register for interaction. Bodies can move, we can point at objects in the world, we can develop complex ad-hoc symbolic conventions, we can probably do a range of unexpected things. Similarly, if the issue is ‘sparse modalities as limiting the synchronous manipulation of the public substrate’, then doesn’t this also mean that telephone conversations have similar issues? Would this consideration not invite a more granular approach that analyzes things like telephone-conversation overlap as a case of limited-modality-synchronicity?

These are currently questions that I’m considering regarding this topic. It will take a serious review of existing literature, a detailed analysis of our data and more collective discussions to arrive at some semi-convincing conclusion. In the meantime, it seems prudent to read more contemporary literature on semiotics, particularly concerning indexicality, and to consult classical conversation analytic studies of spaces with seemingly reduced modal repertoires. Goodwin’s analysis of Peircean semiosis, along with modern semiotic approaches (including Goodwin’s intertwined semiosis) are instructive.


Goodwin, C. (2000). Action and embodiment within situated human interaction. Journal of Pragmatics, 32(10), 1489–1522. doi:10.1016/s0378-2166(99)00096-x 

Mondada, L. (2016). Challenges of multimodality: Language and the body in social interaction. Journal of Sociolinguistics, 20(3), 336–366. doi:10.1111/josl.1_12177

Mondada, L. (2018). Multiple Temporalities of Language and Body in Interaction: Challenges for Transcribing Multimodality. Research on Language and Social Interaction, 51(1), 85–106. doi:10.1080/08351813.2018.1413878 

show less

Our available VR titles

I’ve decided to finally do a write-up of the current social virtual reality applications we have available for research.

Read more

We’re constantly looking for new social spaces, so feel free to message us with suggestions. The core requirements are:

  • There must be a multiplayer (preferably cooperative) element to the VR interaction
  • There must be a reasonably active playerbase
  • The application must be compatible with the HTC Vive

One of the core problems with current virtual reality applications is their relatively low total playerbase. At the time of writing, there are only about 1500 players using VR multiplayer applications worldwide. Only the top three most popular VR-only applications (Pavlov VR, Rec Room, Arizona Sunshine) currently have more than 100 concurrent players (Source). This is likely one of the reasons why some applications allow non-VR users to play as well.

This handicap of cutting edge technologies is slightly annoying for practical purposes. Some for-VR projects with interesting interactional mechanics never took off as multiplayer titles, becoming purely single player experiences instead. This will likely change when VR becomes more accessible to the average consumer, but means that I’ll only cover those applications where there can be a realistic expectation of multi-user interaction. This limitation also has the consequence that there are more game-type virtual spaces, since these tend to attract a more consistent playerbase. The downside of game-type VR applications is that some game loops do not require any meaningful cooperation between players, making them a less appealing subject for VR-interaction research. The first item on this list is one notable exception.

Arizona Sunshine

This checks all the boxes, really. Active playerbase? Check. Cooperation? Check. Arizona Sunshine (AS) is a cooperative post-apocalyptic zombie survival game. Players move through a semi-linear landscape, coordinating anti-zombie activity, searching for loot/scarce ammunition, and navigating around various obstacles. The players have hands, their head and sometimes their voice as resources for cooperation and coordination. See the previous posts to have a glimpse of the rich, varied and unexpected communicative strategies employed by AS players.

Click here to see my blog post about interaction in Arizona Sunshine.


Arguably the weirdest, most advanced, most well-known, most memetic space in contemporary social VR. I still don’t quite know how to make sense of this space in its entirety. On the surface, it’s a social application where VR users from all over the world can meet and mingle. In practice, it’s a cacophonic torrent of nonsense-effervescence, where multitudes of differently-avatared users compete for one another’s attention, deliberate about anything and everything, and compete in the ever-evolving race to more immersive and realistic full-body tracking. It’s an eminently fascinating cultural phenomenon, and slightly scary to encounter for the first time.

Below is a recording of one of our first experiences in VRChat.

Many of the interactions seem to be hinged upon a kind of technological one-upmanship, where players construct elaborate animated avatars to direct the attention of other players in the room. We’ve been debating this topic internally, and there are currently two interesting streams that can be explored. Firstly, there’s the late-Durkheimian/Goffmanian question of the dynamics of attention, and the link between focus management and a sense of ‘being there together’. Secondly, there is the more ‘macro-level’ question of how these avatars circulate through VRChat. Since many of these avatars can be replicated by clicking a button, they can spread like attention-grabbing viruses through different public spaces, until their memetic potential wanes and the space becomes ripe for some new avatar-based innovation.

Rec Room

VRChat’s more well-behaved younger brother. Rec Room is a social environment that features both spaces for pure communication as well as different types of party games (charades, sports, DnD-type adventures). It’s very polished and is therefore accessible to a broad range of users. I’m especially curious about its charades area. I have an inkling that VR-charades have different properties – given the limited range of inputs. As with AS, the hands and head are tracked by default. There are options for additional emotes; proximity-based voice chat is used almost universally.

Altspace VR

A more polished social VR application that seems to be situated between the off-the-rails VRChat and the slightly infantile vibe of Rec Room. It is very similar to Rec Room, but may have some variations in its userbase. In this space, I’d be particularly interested in its VR-boardgames aspect, since the more mature communicative features of Altspace could facilitate novel mixed-mixed-reality interactions between players.

Minecraft VR (Vivecraft)

A VR version of Minecraft, with adapted in-game mechanics. There are a number of semi-active multiplayer servers where users build, farm, mine and fight together. Since this is Minecraft, the possibilities for the types of interactions are limited only by the imaginations of the players. I am personally interested in how large ‘builds’ (i.e. large-scale monument/building constructions) are coordinated between players, and whether they end up significantly employing the VR-specific resources that are available (such as complex deixis).


Onward is a squad-based tactical shooter. It is differentiated from titles like Pavlov or War Dust by being focused on team coordination and the slow, tactical progression through the level (depending on the specific objective). While it features simulated radio band communication, the player has to actually physically trigger the radio transmitter with their hand. This encourages a reliance on nonverbal communication and makes this specific game more interesting for our purposes.

Another curious specificity of Onward is its focus on realism: every gun has to be handled based on its real-world reloading sequence. This results in a steep learning curve for even the most basic actions (such as reloading).

Keep Talking and Nobody Explodes

This game is built around a communicative asymmetry between two human players. One player can see, describe and manipulate a virtual bomb, while the other player, using a physical printed manual, has to guide the defusal process ‘from the real world’. This environment is especially curious as it features a kind of ‘control room logic’ reminiscent of the work of Suchman and Reeves et al.

Maria and I decided to upload a video of how this works. The top right corner shows my view (unavailable to Maria).


Bigscreen is a space for the collective viewing of screens – as the name implies. Here, people can meet for virtual presentations, streams or watch TV shows together. Here, the point of interest is the fact that there is a mandated central point of attention (the big screen, literally), a fixed seating arrangement and the normal interactional repertoire of 6 DOF VR. In other words, it’s a space for multi-level involvement within a multi-level involvement space.

Payday 2

This is a popular cooperative bank heist game that received a VR update. We have not yet tried it out, but it may be a good candidate for a more detailed look, especially since it’s such a mainstream (read: large playerbase) title.

Honourable mentions

Below is a list of titles we have access to, but which are either too sparsely populated or too ‘non-social’ to have made it on the above list. Some of the titles on this list may end up being investigated in greater detail later. Some of these ‘non-socialities’ are quite interesting in their own right: how do spaces that were designed to be cooperative become spaces of purely individual action? I.e. how does the social sum end up equaling its individual parts?

  • Climbey
  • War Dust
  • Pavlov
  • Jet Island
  • Stand Out
  • Karnage Chronicles
  • VR Dungeon Knight

show less

Data Anonymization in Virtual Reality

Translation: Nils Klowait

Before Nils and I could step foot into the VR research space, we faced the problem of data anonymization.

Read more

Sociologists who work with video data have developed many techniques for this: blurring or pixelating faces, modifying voices, and transforming video into graphics. However, in virtual reality, the issue of data anonymization starts to get weird. A fair number of VR applications already anonymize users to a certain extent: firstly, users do not have faces or bodies – they are replaced by virtual avatars; secondly, the person’s real name is typically not displayed – it either remains unknown or is replaced by a virtual alias. Finally, the user may not even have a voice (as was described in Nils’ previous post). The existing identifying features in virtual reality are the voice (if present) and user actions.

Figure 1. Using graphics to present video data. Source: (Goodwin, 2017, p.229).

The presence of a basic level of anonymization in VR raises questions about the need for further anonymization and informed consent regarding the “videotaping” (i.e. the recording of the researcher’s screen, typically from a first-person perspective) of the interaction. In the field of video analysis, the problem of the impossibility of anonymization is solved by obtaining consent from the participants. For example, when Christian Heath researched auction-house interactions (Heath, 2012), he was not in a position to anonymize data: paintings worth more than two million dollars could only possibly be sold by a very small number of companies.
The impossibility of anonymization can also be purely grounded in research methodology: if you study the direction of gaze, blurring faces is undesirable for reasons relating to transcription and analysis. The most straightforward solution to this problem is obtaining consent from the participants. In cases where consent is not given or is impossible, technical workarounds may be used; for example, the interaction may be sketched by an artist.

In VR, obtaining consent is more problematic. Take cooperative games. 30 players can participate in them at once – would it be necessary to obtain informed consent from all people involved? This is very difficult for both technical and methodological reasons. Firstly, such games do not necessarily feature a list of players (and rarely feature contact details beyond a nickname) which can then be used to obtain per-player consent. Secondly, the greater the number of players, the more likely it is to receive a refusal to use data from at least one participant. If one person in thirty disagrees, then how is the issue to be resolved? Will there be a simple majority rule? Thirdly, since the VR segment is rather small, players will likely remember the virtual avatar of the researcher which may affect the course of the game during the next round of data collection.

Figure 2. RecRoom users.

Similar problems exist in VR chat rooms, where they are further exacerbated by the fact that users constantly drop in and drop out of the space without prior notice (this issue exists in games also, but is normatively sanctioned in most cases). If you take the simpler example of dyadic interaction, as was the case in Nils’ example of Arizona Sunshine interaction, the problems do not disappear. To ask for written permission prior to the start of the game is inadvisable as it can potentially influence its course. Obtaining permission after the game is equally difficult, as the player may leave the game abruptly at any time – without leaving behind contact details.

In other words, obtaining consent for the use of data from virtual reality raises a huge number of questions. To simplify your life, you can anonymize the data to such an extent that the problem of consent arguably does not arise. However, this is also not so simple. For example, is it necessary to modify the user’s voice? If we respond positively to this question, then this must be done in all cases. After all, this amounts to the statement that voice = identity. If we think ahead beyond the (already complex) case of voice identity, we may ask ourselves: are actions, too, an identifying attribute? Is a particular set of actions (such as the solution to an ingame puzzle) part of what makes a user recognizable? If yes, then data from virtual reality cannot be used in principle. We return to the question I raised at the beginning of this post: where are the limits of anonymization of data for an anonymous, contingent space?

Figure 3. An Arizona Sunshine player takes of their hat.

I think research of computer games may be instructive in addressing these issues, as they share a basic level of anonymization with VR.

Heath, C. (2012). The Dynamics of Auction: Learning in Doing: Social, Cognitive and Computational Perspectives. Cambridge: Cambridge University Press. doi: 10.1017 / CBO9781139024020

show less

On the Multimodal Accomplishment of an Ambiguous Multi-step Action in Virtual Reality.


The following post is intended to be the first of many methodological/experiential ruminations about researching specific VR-related phenomena. Most of the terminology employed will be adapted from key microsociological works in related areas. For a primer, see Suchman (1987) and Goodwin (2017), along with Sacks (1995).

All of the posts are meant to be highly preliminary and speculative. I feel that this is useful for people who want to join (and maybe redirect) the conversation about microsociology in virtual reality.

Read more

Gestural Work in ‘Arizona Sunshine’

One of my recent forays into virtual reality was a cooperative multiplayer-session in ‘Arizona Sunshine’, a VR-game where you navigate a post-apocalyptic zombie-infested landscape in search of supplies and survivors. Much of the game involves searching cupboards, car-trunks and (hopefully!) abandoned buildings for guns, ammunition and health kits. It also involves shooting zombies in the head, naturally.

As this is a virtual reality game, you play it by being strapped into a virtual reality headset (the Vive Pro, in our case). The Vive accurately tracks the headset in 3D-space (if I move my head in real life, the same movement occurs in virtual reality, with a corresponding change of perspective). Additionally, it tracks two controllers (one for each hand) which serve as the main input mechanisms for most virtual reality games. I will make a post about the specifics of our lab’s VR system in a separate post.

In this session, I was joined by an unnamed second player from the United States. They seemed to be much more knowledgeable of the gameplay mechanics than I was, so I effectively had a kind-of tour guide who would point out locations of various items and instruct me to perform certain actions. On more than one occasion, the other player also saved me from being eaten by zombies – but I digress.

In all our interactions, we both had to employ the at-hand resources provided to us by a unique confluence of technological affordances and conventions. Before demonstrating this via the central example of this post, let us discuss the specific technological limitations/opportunities:

Figure 1. My first-person perspective. The large objects on both sides are the virtual guns in my hands.

As can be seen in the above screenshot, both characters (usually) have guns equipped in both hands. Things can still be grabbed with the guns in hand, but the guns themselves tended to be held at all times. Guns can be pointed in any direction and be used to achieve quite a bit of gestural work. Another mutually visible in-game element is the head-tilt. Even though the eyes are not tracked in this game, the position of the head is tracked by the VR headset. Another specificity: though the game allows players to communicate through speech, the other player’s microphone was muted in this instance. This circumstance created the necessity to ‘get by’ solely through the use of non-verbal resources.

For the above reasons, most of the interaction between me and the other player occurred through a combination of head- and hand movement, with the specific restriction that both hands were occupied by guns in most cases. Given these conditions, one could have assumed that very little happened between us, interaction-wise. In actuality, these resources were more than enough to provide a scaffolding for procedurally-generated ad-hoc conventions:

The head

Coordinated nodding was used to accomplish mutually signaled agreement or understanding. This agreement could be supported by hand gestures if necessary. The position of the head was used to maintain a mutual focus of attention on a particular area of the game world, or coordinate a shift from one area to another. The reciprocation of gaze could form a temporary participation framework, or disrupt it by the visible disattendance (‘loss of focus’) by any participant.

The ‘gun hands’

Compared to the physical world, Arizona Sunshine does not track fingers – the controllers are effectively monodirectional sticks, and are tracked as such. This limitation has the consequence of reducing the gestural repertoire to a linear pointing: the guns can be oriented in any direction in 3D-space, and be moved through it without restrictions. Thus, it was impossible to do a ‘victory sign’, a ‘thumbs up’, a‘middle finger gesture’ through the usual combination of individual finger arrangements. This does not, however, mean that gestures could not be developed. After all, the temporal-historical movement of sticks through space, coordinated and witnessable in-situ, allows for a rather complex sequence of constellations (a conductor being an obvious example; note to self: look up research on cheerleading conventions).

Arcs could be drawn between a current head-tilt focus and an intended target. The ‘sticks’ could be waved, they could track movement, and they could be used to express agreement or disagreement. One curious convention for ‘thumbs up’ was developed in the current example: at certain points I had the need to express something like a cheer or a ‘yeah!’, so I used a rapid ‘air pumping’ movement of both hands (alternating or synchronized):

Similarly, disagreement or ‘no’ (the actual use of the gesture was, naturally, not predetermined by a convention) could be achieved by crossing the arms or waving the guns side-to-side in a synchronized fashion. At one point (sadly, the recording was corrupted), the other player performed an ‘opinion asking’ by taking two masks (one in each hand) and doing a ‘weighing’ motion, which made it possible for me to point out the mask I preferred.

Everything I’ve said so far could have been interpreted as scenarios where interaction was somehow limited. I’m not sure whether that’s a useful way to look at it: we do not consider physical interaction as ‘interaction under conditions where we don’t have four hands’. We coordinate action with whatever is within actionable grasp. If one resource is not available, perhaps there are others that can do similar work. Or perhaps the nature of the interactional work itself may be adapted to the resources. An interesting case of the latter is the use of the guns for action-at-a-distance: as was mentioned previously, the other player did not have an active microphone.

Coupled with the fact that we often were several dozen ingame meters apart, we were faced with the difficulty of gaining the attention of the other player. This issue was solved by shooting the other player in the head.The shot caused the player’s screen to turn red (in addition to the audible cue) and accomplished a shift of attention without getting closer to the player. This telecommunicative convention was particularly useful for the ‘quest’ part of the game. During certain stages, quest-relevant items had to be retrieved from a particular spot (as can be seen in the following video, a key was located in car). As such, it was useful to have a means of ‘summoning’ the other player rather than leading them, step-by-step, to the specific spot. Headshots accomplished that. In short, here we have a convention that made creative use of ingame mechanics originally designed for other purposes.

Disambiguating instructions

Perhaps the most interesting case of complex co-operative action occurred in the following video. The first part shows ‘gun-based long-distance communication’, but the ‘meat’ of the video comes later.

Some background: Arizona Sunshine allows players to don masks which are scattered throughout the game world. In order to put on the mask, a player has to grab it and drag it onto their own face (mimicking the way a real-life mask would be put on). If the mask is close enough to the face when it is released, it automatically equips. Earlier in the session, I actually found (and equipped) a mask without issue. When we entered a house, the other player changed their mask, discarding the old mask onto the floor. Having forgotten that I was already wearing a mask, I decided to put on the mask the other player dropped. This is, however, not possible through the same set of movements. In order to put on the mask, I had to first unequip the previous mask by grabbing my face, dragging away the mask, dropping it, then putting on the new mask.

Watch the video to see how the other player assists me in figuring out how to put on the new mask. One of the primary difficulties is the fact that the sequence cannot be simply shown to me, as it looks similar to the process of putting on a mask for the first time. In order to disambiguate the message ‘this is how you put on a mask’ and ‘remove the old mask, then put on the new mask’, a lot of work needed to be done, particularly considering that the fact of my mask-wearing was visible to the other player but not myself.

This sequence alone could be the focus of an entire research endeavour, with particular attention being played on the exact sequentiality of gestures, nods, uses of props and manipulations of a continuously morphing public substrate. The latter term warrants some explanation. I highly recommend reading the first chapter of Goodwin (2017) in its entirety, but here’s the gist of it:

This process of building something new through decomposition and reuse with transformation of resources placed in a public environment by an earlier actor is what I am investigating as co-operative action. As all of the materials to be examined later in this book demonstrate, it is pervasive in the organization of human action. The ability of this process to endow human history with its unique accumulative power, that is, as something progressively shaped by a consequential past, while remaining both contingent and open-ended, is captured by Merleau-Ponty’s (1962:88) observation that “history is neither a perpetual novelty, nor a perpetual repetition, but the unique movement which creates stable forms and breaks them up.” For simplicity he term substrate is used to point to the earlier utterance, or another kind of sign complex (a hopscotch grid, for example), that is the focus of transformative operations being used by another actor to create a next action.

(Goodwin, 2017, p. 3, emphasis mine)

I actually tried to do a gloss-like description of the whole procedure just now, but it turned out to be far too complex, coordinated and sequentially unfolding to be a useful exercise. What’s necessary is a detailed multimodal transcript to figure out what was visibly being done, possibly through ELAN or something more Goodwinian.

I am currently interested in gathering much more data (preferably with different players) on how at-hand resources are created, employed and modified. I am not certain that the specific set of conventions (such as the ‘yeah!’) will emerge as typical resources for most players, particularly when one considers constellations where all players can use the resource of verbal communication. I will gather more recordings and will keep you updated with additional ruminations. At the moment it seems to me that Goodwinian multimodal analysis might be a good framework for actual analytic work, especially since he has had a historic focus on (for lack of a better term) ‘limited’ interaction. It would be interesting to see whether particular kinds of interactional restrictions lead to an increased ‘mutual action-inhabiting’ and substrate-building.


Goodwin, C. (2018). Co-operative action. Cambridge University Press.

Jefferson, G. (Ed.). (1995). Harvey Sacks: lectures on conversation. Blackwell.

Suchman, L. A. (1987). Plans and situated actions: The problem of human-machine communication. Cambridge University Press.

show less