Entity Resolution (ER) is a fundamental task of data integration: it identifies different representations (i.e., profiles) of the same real-world entity in databases. To compare all possible profile pairs through an ER algorithm has a quadratic complexity. Blocking is commonly employed to avoid that: profiles are grouped into blocks according to some features, and ER is performed only for entities of the same block. Yet, devising blocking criteria and ER algorithms for data with highly schema heterogeneity is a difficult and error-prone task calling for automatic methods and debugging tools. In our previous work, we presented Blast, an ER system that can scale practitioners’ favorite Entity Resolution algorithms. In current version, Blast has been devised to take full advantage of parallel and distributed computation as well (running on top of Apache Spark). It implements the state-of-the-art unsuper- vised blocking method based on automatically extracted loose schema information. It comes with a GUI, which allows: (i) to visualize, understand, and (optionally) manually modify the loose schema information automatically extracted (i.e., injecting user’s knowledge in the system); (ii) to retrieve resolved entities through a free-text search box, and to visualize the process that lead to that result (i.e., the provenance). Experimental results on real-world datasets show that these two functionalities can significantly enhance Entity Resolution results.

Enhancing Loosely Schema-aware Entity Resolution with User Interaction

luca gagliardelli;
2018-01-01

Abstract

Entity Resolution (ER) is a fundamental task of data integration: it identifies different representations (i.e., profiles) of the same real-world entity in databases. To compare all possible profile pairs through an ER algorithm has a quadratic complexity. Blocking is commonly employed to avoid that: profiles are grouped into blocks according to some features, and ER is performed only for entities of the same block. Yet, devising blocking criteria and ER algorithms for data with highly schema heterogeneity is a difficult and error-prone task calling for automatic methods and debugging tools. In our previous work, we presented Blast, an ER system that can scale practitioners’ favorite Entity Resolution algorithms. In current version, Blast has been devised to take full advantage of parallel and distributed computation as well (running on top of Apache Spark). It implements the state-of-the-art unsuper- vised blocking method based on automatically extracted loose schema information. It comes with a GUI, which allows: (i) to visualize, understand, and (optionally) manually modify the loose schema information automatically extracted (i.e., injecting user’s knowledge in the system); (ii) to retrieve resolved entities through a free-text search box, and to visualize the process that lead to that result (i.e., the provenance). Experimental results on real-world datasets show that these two functionalities can significantly enhance Entity Resolution results.
2018
9781538678787
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11389/69840
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact