Language-Specific and Multilingual Full-Text Searching

Sun, 18 Sep, 14:00 – Room 1 „Microsoft“
Presentation in English, Intended for: Site builders
Speaker/Host:
mkalkbrenner

Apache Solr Search Integration provides a (more or less) easy way to use Apache Solr as a powerful search engine for Drupal. Unfortunately, the only language that works well with it, out-of-the-box, is English.

So if you run a non-English website, you need to tweak all the configuration files by hand or you lose some of the advantages that Solr provides compared to Drupal's built-in database-driven search. Doing so requires a deep knowledge of Solr and search technology in general.

The entire process gets even more complicated if you run a multilingual website.

Apache Solr Multilingual hides most of the complexity from a Drupal website's administrator.

Nevertheless, you need a basic knowledge of full-text searching and understanding of language-specific problems:

  • Stop Words
    Words you want to exclude from your search index are called stop words. The list of words strongly depends on the focus of your website and, of course, on your site's language.
  • Stemming
    Every word in the search index is stored in a reduced form called a word stem. This strategy enables the user to find content, independent of the key word's inflection, e.g. singular or plural. Unfortunately, the stemming algorithm is different from language to language.
  • Protected Words
    In some cases, you'll want to exclude certain words from the stemming described above. These protected words are language-specific, like stop words.
  • Compound Word Splitting
    Languages like German frequently combine words (e.g. "Dampfschifffahrt"). In order to deal with that problem you need to split such words into parts depending on language-specific word catalogs.
  • Spell Checking
    No doubt that spell checking should be language-specific.
  • ...

This session is not about installing Apache Solr and connecting it to Drupal, but I will try to explain the language-specific problems and how Apache Solr Multilingual helps to solve them for Drupal.

The background information and explanations will be helpful for different search engines as well.