Workshop Example: Data Sprints

data sprint examples

A crucial element of the Data Inquiry approach is the collaboration between the expert/apprentice data scientists and the actors engaged in actual societal situations. This collaboration helps the data scientists to consider not only the technical dimension of their intervention, but also the context in which it takes place and the conditions that can make it more than a simple exercise of technical skill. Civil society groups, for their part, will find not only help in the collection and treatment of data, but more importantly a fresh perspective on how their action can exploit digital records.

In the first phases of a project, the collaboration between data scientists and civil society organisations can happen remotely and asynchronously, by an exchange of email that allows to initiate a dialogue and converge to a shared definition of a possible joint project. Yet, the experience has taught us that the best format to actually carry out the bulk of the collaboration is an intensive time-boxed workshop, where the two groups are present and work together (if possible, in the physical presence of one another) for at least 2 or 3 days and, if possible, for an entire week.

Over the years, the researchers of the Public Data Lab have developed a specific template for this kind of workshopping, learning from the open-source format of barcamps and datathons, and adapting it to the specificities of academic research and teaching.

Data sprints import two things from its open-source predecessors:

The ‘quick and dirty’ (or ‘design to cost’) approach. The short and intensive nature of data sprint shields these events from the dream of exhaustivity sometimes associated with ‘big data’. Participants know that they will only be able to treat a limited quantity of records and that they will only achieve imperfect results, but they accept such constraints more as a challenge than as a flaw. Making the most out of light infrastructures, simple logistics and agile organizations, participants are aware that their work should reuse code and data gathered in earlier projects and that their outcomes should become the basis for further ventures.
The heterogeneity of the actors involved. The need to achieve deliverable results by the end of the event requires the gathering of all competences required both as in terms of technical skills and in terms of the knowledge of the social situation at stake (hence the importance of convening all the participants during the sprint).

Unlike hackathons and barcamps, however, data sprints are always preceded by extensive preparation. Because the time available during a data sprint is limited, it is crucial to carry out some activities before the data sprint:

Posing the intervention objectives. While identification of the specific goals of the intervention should be carried out before the sprint itself, it is important that these objectives are not just imposed by the civil society group initiating the intervention but discussed and co-defined among all the participants of the sprint.
Operationalizing the intervention objectives into feasible projects. We found that an excellent way of doing this initial vetting is to have collective discussion about which existing datasets could be accessed and exploited for the intervention. This provides a chance for all participants to attune to what the data project can and cannot achieve.
Procuring and preparing datasets. If relevant datasets are identified beforehand, their harvesting and cleaning should be carried out before the sprint, as these are generally time-consuming operations. If no suitable dataset exists, then the objective of the sprint can become precisely such a primitive collection.

A careful preparation allows to dedicate as much as possible of the time allotted to the workshop for the activities that can only be carried out during the data sprint:

Writing and adapting scripts. Since the focus of the sprint is the preparation of an actual societal intervention, designing new technical solutions is less important than adapting existing code to the goals of the interventions.
Designing data visualizations and interfaces. One of the driving forces of data sprints is that they deliver tangible outcomes. These outcomes may have different forms, but they always share the characteristic of being usable by societal actors. Often, this results in the civil society groups leaving the sprints with tangible results that they immediately mobilize in their actions. More generally, this means that specific efforts should be invested to design the outcomes of the sprint so that they are relevant not only for the data scientists, but also for their potential users.
Considering and managing on societal implications. This last mode of engagement is necessary to create a space for researchers and actors to experiment with new collective arrangements – or hybrid forums as we called them above. The six activities described above should all be arranged so that this type of engagement might be achieved. If sprints fail in creating a common space for the co-produce knowledge between social scientists and social actors, they fail in all other respects as well.

Finally, a greater follow-up than hackathons and barcamps is necessary after the data sprint. The ‘quick and dirty’ approach that characterizes the sprinting days should be complemented by an extensive work of refinement and documentation, in order to make sure that the work of the sprint actually bears fruit and generates the desired societal outcomes. Besides following-up on the specific objectives of the spring, efforts should be invested in making datasets, scripts and visualizations reusable beyond their original projects. Sprints should remain faithful to their open-source roots and ensure that all the data, code and content produced are freely available through open licenses.

A more detailed description of the data sprint format can be found here