Date of Award

12-2016

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

First Advisor

Bou medlene Belkhouche

Second Advisor

Salah Bouktif

Third Advisor

Abderrahmane Lakas

Abstract

We developed techniques for finding local text reuse on the Web, with an emphasis on the Arabic language. That is, our objective is to develop text reuse detection methods that can detect alternative versions of the same information and focus on exploring the feasibility of employing text reuse detection methods on the Web. The results of this research can be thought of as rich tools to information analysts for corporate and intelligence applications. Such tools will become essential parts in validating and assessing information coming from uncertain origins. These tools will prove useful for detecting reuse in scientific literature too. It is also the time for ordinary Web users to become Fact Inspectors by providing a tool that allows people to quickly check the validity and originality of statements and their sources, so they will be given the opportunity to perform their own assessment of information quality.

Local text reuse detection can be divided into two major subtasks: the first subtask is the retrieval of candidate documents that are likely to be the original sources of a given document in a collection of documents and then performing an extensive pairwise comparison between the given document and each of the possible sources of text reuse that have been retrieved. For this purpose, we develop a new technique to address the challenging problem of candidate documents retrieval from the Web. Given an input document d, the problem of local text reuse detection is to detect from a given documents collection, all the possible reused passages between d and the other documents. Comparing the passages of document d with the passages of every other document in the collection is obviously infeasible especially with large collections such as the Web. Therefore, selecting a subset of the documents that potentially contains reused text with d becomes a major step in the detection problem. In the setting of the Web, the search for such candidate source documents is usually performed through limited query interface. We developed a new efficient approach of query formulation to retrieve Arabic-based candidate source documents from the Web. The candidate documents are then fed to a local text reuse detection system for detailed similarity evaluation with d. We consider the candidate source document retrieval problem as an essential step in the detection of text reuse.

Several techniques have been previously proposed for detecting text reuse, however, these techniques have been designed for relatively small and homogeneous collections. Furthermore, we are not aware of any actual previous work on Arabic text reuse detection on the Web. This is due to complexity of the Arabic language as well as the heterogeneity of the information contained on the Web and its large scale that makes the task of text reuse detection on the Web much more difficult than in relatively small and homogeneous collections. We evaluated the work using a collection of documents especially constructed and downloaded from the Web for the evaluation of Web documents retrieval in particular and the detailed text reuse detection in general. Our work to a certain degree is exploratory rather than definitive, in that this problem has not been investigated before for Arabic documents at the Web scale. However, our results show that the methods we described are applicable for Arabic-based reuse detection in practice. The experiments show that around 80% of the Web documents used in the reused cases were successfully retrieved. As for the detailed similarity analysis, the system achieved an overall score of 97.2% based on the precision and recall evaluation metrics.

COinS