The Construction of String Similarity Predicates
Аннотация:Revision with unchanged content. In times of worldwide globalisation the knowledge of useful information is becoming increasingly important. Parallel to genetic engineering, the expansion of the Internet produces similar volumes of data - frequently saved in text files. One of the most relevant intersection is the usage of approximate string matching in large text data. The Internet has to face the challenge of not only to concentrating on request times but also finding more context-relevant information. Associated with this aim, further steps in this field have to take into consideration that documents can include mistakes in orthography or words being abbreviated. Other areas of information are substituted with their acronyms or are less important and can be ignored. All of these tasks are united in the fields of computational linguistics. This master thesis shows stepwise the tokenising of real text, the homogenisation of words, and the storage in a specific index structure for subsequent approximate string matching - in consideration of secondary storage. A prototype programmed in Java completes the current work.