Once upon a time, I had a job that included looking through boxes of documents that were supposedly related to environmental litigation, but were generally (a) unrelated, (b) dusty and (c) mind-numbingly dull. Earlier this year, as I looked back on those dark days, it seemed to me that crowdsourcing would be a great tool for a first pass through documents, helping a legal team focus its efforts away from documents that are obviously not responsive to a given request.
To test this suspicion, we used a dataset of ~2,700 documents that were pre-coded with relevance assessments by a team of legal experts associated with the TREC 2010 Legal Learning task.1 We asked multiple workers whether each document, emails made public during the course of the Enron investigation, was responsive to a request for Residential Real Estate (full instructions here):
“So a Team of Lawyers walks into a Conference Room…”
We ran two iterations of the document review task. In the first, we used a Gold 2 distribution of 50% Relevant and 50% Not Relevant. For the second iteration, we increased the proportion of Relevant documents in our Gold set to 60%. Our thinking was that, at least in the context of litigation, returning an irrelevant document (false positive) was preferable to missing a Relevant document (false negative).
In the chart above, Recall is the percentage of all responsive documents that were returned; it measures how thorough the search is. Precision is the percentage of returned documents that are responsive; it measures how accurate the process is. F1 is the harmonic mean, a simple summary measure that rewards high values in both. We used a simple majority among workers to determine a document’s relevance.
Before looking at how this performance compares with manual and automated document review, it’s worth noting that changing the Gold distribution had very little effect on the overall accuracy, but it had a large effect on the distribution of errors. By increasing the frequency of Relevant Gold units, we cut the number of false negative errors by nearly 50%. In a context where one type of error is relatively more “expensive” than another, this is a useful tool to be aware of.
Without running the same dataset through crowdsourced, automated, and manual document review, it’s difficult to compare performance across methods. Nevertheless, Grossman and Cormack (2011) discuss manual and automated document review, finding that average recall for manual review of documents can be as low as 20-50%, though typically with much higher precision. For automated review on a dataset similar to the one we used, recall averaged 77%, though with average precision of 85%.3
Living in a (non)Binary World
Every document in our test was also graded with a probabilistic measure of Relevance by default. Because we asked multiple reviewers whether a given document was Relevant, we used inter-coder agreement to suggest the likelihood that a document is responsive. Further, because we tracked each individual worker’s performance on Gold, we weighted each worker’s contribution to the agreement by his/her estimated accuracy.
For this exercise, we remapped our confidence scores such that a document that was the least likely to be Relevant received a Relevance Score of 0.01, while the documents most likely to be Relevant received a Relevance Score of 0.99. 4 The distribution of documents by Relevance Score is included below.
Note that because most documents in this test received judgments from three different workers, there isn’t much variation in the middle of the distribution. Most units had either unanimous agreement or a 2/1 split. Nevertheless, the Relevance Score makes it possible to set a threshold on what should be considered Relevant.
By changing the threshold, we can include any document that received at least one judgment of Relevant (increased Recall) or to include only documents that did not receive any judgment of Not Relevant (increased Precision). As shown below, different thresholds dramatically influence the number and characteristics of the documents returned as Relevant.
While there is no substitute for trained legal experts, these results show that crowdsourcing is an effective complement to eDiscovery document review. The promise of putting multiple pairs of eyes on every document dramatically decreases the likelihood of missing a relevant document. And consider that in less than 24 hours, we collected over 15,000 unique relevance judgments on nearly 3,000 documents, and for much less than the billing rate of your average attorney.
2 One of the ways that we control for quality is by randomly inserting a subset of units for which we already know the answers. We refer to this data as Gold. We track worker performance on these Gold units as a proxy for overall accuracy. Additional documentation is here.
3 Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII RICH. J.L. & TECH. 11 (2011), http://jolt.richmond.edu/v17i3/article11.pdf
4 For “Relevant” documents, P(Relevance)=0.99*(Confidence). For “Not Relevant” documents, P(Relevance)=1-0.99(Confidence)