Annotating Web Tables with the Crowd

Ning Wang

School of Computer and Information Technology, Beijing Jiaotong University No. 3 Shangyuancun, Haidian District, 100044 Beijing, China
Huaxi Liu

School of Computer and Information Technology, Beijing Jiaotong University No. 3 Shangyuancun, Haidian District, 100044 Beijing, China

keywords: Crowdsourcing, semantic recovery, web tables, information integration

The Web contains a large amount of structured tables, most of which lacks header rows. Algorithmic approaches have been proposed to recover semantics for web tables by annotating column labels and identifying subject columns. However, state-of-the-art technology is not yet able to provide satisfactory accuracy and recall. In this paper, we present a hybrid machine-crowdsourcing framework that leverages human intelligence to improve the performance of web table annotation. In this framework, machine-based algorithms are used to prompt human workers with candidate lists of concepts, while an improved K-means algorithm based on novel integrative distance is proposed to minimize the number of tuples posed to the crowd. In order to recommend the most related tasks for human workers and determine the final answers more accurately, an evaluation mechanism is also implemented based on Answer Credibility which measures the probability of a worker's intuitive answer being the final answer for a task. The results of extensive experiments conducted on real-world datasets show that our framework can significantly improve annotation accuracy and time efficiency for web tables, and our task reduction and answer evaluation mechanism is effective and efficient for improving answer quality.

reference: Vol. 37, 2018, No. 4, pp. 969–991

doi: 10.4149/cai_2018_4_969

Computing and Informatics

formerly Computers and Artificial Intelligence

Annotating Web Tables with the Crowd