Extracting structured information from text to augment knowledge bases

SILVA, Johny Moreira da

Use este identificador para citar ou linkar para este item: https://repositorio.ufpe.br/handle/123456789/34145

Compartilhe esta página

Título:	Extracting structured information from text to augment knowledge bases
Autor(es):	SILVA, Johny Moreira da
Palavras-chave:	Banco de dados; Processamento de linguagem natural
Data do documento:	25-Fev-2019
Editor:	Universidade Federal de Pernambuco
Abstract:	Knowledge graphs (or knowledge bases) allow data organization and exploration, making easier the semantic understanding and use of data by machines. Traditional strategies for knowledge base construction have mostly relied on manual effort, or have been automatically extracted from structured and semi-structured data. Considering the large amount of unstructured information on theWeb, new approaches on knowledge bases construction and maintenance are trying to leverage this information to improve the quality and coverage of knowledge graphs. In this work, focusing in the completeness problem of existing knowledge bases, we are interested in extracting from unstructured text missing attributes of entities in knowledge bases. For this study, in particular, we use the infoboxes of entities in Wikipedia articles as instances of the knowledge graph and their respective text as source of unstructured data. More specifically, given Wikipedia articles of entities in a particular domain, the structured information of the entity’s attributes in the infobox is used by a distant supervision strategy to identify sentences that mention those attributes in the text. These sentences are provided as labels to train a sequence-based neural network (Bidirectional Long Short-Term Memory or Convolutional Neural Network), which then performs the extraction of the attributes on unseen articles. We have compared our strategy with two traditional approaches for this problem, Kylin and iPopulator. Our distant supervision model have presented a considerable amount of positive and negative training examples, obtaining representative training examples when compared with the other two traditional systems. Also, our pipeline extraction have shown better performance filling the proposed schema. Overall, the extraction pipeline proposed in this work outperforms the baseline models with an average increase of 0.29 points in F-Score, showing significant difference in performance. In this work we have proposed a modification of the Distant Supervision paradigm for automatic labeling of training examples and an extraction pipeline for filling out a given schema with better performance than the analyzed baseline systems.
URI:	https://repositorio.ufpe.br/handle/123456789/34145
Aparece nas coleções:	Dissertações de Mestrado - Ciência da Computação

Arquivos associados a este item:

Arquivo	Descrição	Tamanho	Formato
DISSERTAÇÃO Johny Moreira da Silva.pdf		4.02 MB	Adobe PDF	Visualizar/Abrir

Este arquivo é protegido por direitos autorais

Ver licença

Mostrar registro completo do item Recomendar este item Visualizar estatísticas

Este item está licenciada sob uma Licença Creative Commons