Annotating Web Tables with the Crowd
keywords: Crowdsourcing, semantic recovery, web tables, information integration
The Web contains a large amount of structured tables, most of which lacks header rows. Algorithmic approaches have been proposed to recover semantics for web tables by annotating column labels and identifying subject columns. However, state-of-the-art technology is not yet able to provide satisfactory accuracy and recall. In this paper, we present a hybrid machine-crowdsourcing framework that leverages human intelligence to improve the performance of web table annotation. In this framework, machine-based algorithms are used to prompt human workers with candidate lists of concepts, while an improved K-means algorithm based on novel integrative distance is proposed to minimize the number of tuples posed to the crowd. In order to recommend the most related tasks for human workers and determine the final answers more accurately, an evaluation mechanism is also implemented based on Answer Credibility which measures the probability of a worker's intuitive answer being the final answer for a task. The results of extensive experiments conducted on real-world datasets show that our framework can significantly improve annotation accuracy and time efficiency for web tables, and our task reduction and answer evaluation mechanism is effective and efficient for improving answer quality.
reference: Vol. 37, 2018, No. 4, pp. 969–991