Same-Same But Different: On Understanding Duplicates in Stack Overflow


Stack Overflow (SO) is one of the most popular online sites for asking and answering developers’ questions. New posts that cover exactly the same knowledge as previously posted questions get closed and deleted by the community. However, new posts that are very similar to previous questions but which are phrased slightly different are kept and tagged as duplicates: since they might include additional information, hints, or keywords. In this paper, we study exact duplicates and similar duplicates in SO in order to get insights about their properties and content and to understand how the community distinguishes useful from useless (i. e. to be deleted) redundant knowledge. We identified several interesting trends. Unique questions are significantly longer than others. Original questions get answered faster, include more answers, and get more frequently viewed than exact and similar duplicates. When comparing the overlapped text in duplicate pairs, we found almost no difference between exact and similar duplicates. In both cases, about 20–25 % of the question text and 40 % of the tags are identical in an original and its duplicate. However, the answers of the duplicates seem much more diverse with only 5–6 % repeated text.

You can read the full paper here.

