Nowadays, semantic information plays an important role in natural language processing, more specifically describing and representing “the meanings of the words” crucial for understanding the human language.
In the last two decades, there have been efforts to create a large database that represents lexical knowledge, where the words and their meanings are represented along with connections held between them. However, in most of the cases, this resources are created manually. For instance Princeton WordNet is considered the standard model of a lexical ontology for the English language. Besides that, also for Portuguese there have been some attempts to create a broad-coverage ontology, also created manually and not publicly available. Still, they are not public available for download, and also all of them were manually created. Despite being less prone to errors, the problem is that the manual creation of these resources takes a lot of time consuming and requires a team, and researchers specialised in the area.
Nevertheless, in the last years, some efforts have been made to develop computational tools to reduce the need of manual intervention, such as some authors that propose lexico-semantic patterns to find semantic relations between terms in text. This kind of approach should be considered as an alternative and subject of research, in order to avoid impractical human work in the construction of these resources.
Having this in mind, the work expected in this project is the creation of a system capable of automatically acquire semantic knowledge from any kind of Portuguese text. The extraction method is based on lexico-syntactic patterns, that indicate a relation of interest, and also by a inference method to extract hypernymy relations from compound nouns. Also, different kinds of textual resources are used to test and improve our system.
Furthermore, this work analyses the benefits from applying similarity distributional metrics based on the occurrence of words in documents to our system outputs. The quality and the utility of the knowledge extracted from the various textual
resources, will be compared against another Portuguese knowledge-base. In the end of this research, important contributions for the computational processing of Portuguese language are provided, such as computational tools capable of extracting and inferring lexico-semantic information from text, methodologies to automatically validate these knowledge, and also compare knowledge-bases. Finally, the experiments outcomes and conclusions are published in important conferences
information extraction, information retrieval, lexical ontologies, lexico-syntactic patterns, semantic knowledge, semantic relations.


Natural Language Processing

MSc Thesis

Automatic Extraction and Validation of Lexical Ontologies from text, September 2010

