One of the best and most used data sets in information retrieval research. This data set contains Enron e-mail messages and attachments from about users, mostly senior management of Enron, organized into folders. This data was originally made public, and posted to the web, by the US Federal Energy Regulatory Commission during its Enron investigation.
The data set was created by EDRM. Download by BitTorrent recommended Download by http at edrm. For example, if you specify "de" as the language, you will get sorting that works well for the German language. If you specify "de" as the language and "CH" as the country, you will get German sorting specifically tailored for Switzerland.
In the example above, we defined the strength as "primary". The strength of the collation determines how strict the sort order will be, but it also depends upon the language. For example, in English, "primary" strength ignores differences in case and accents. The type will be used for the fields where the data contains Polish text.
The "secondary" strength will ignore case differences, but, unlike "primary" strength, a letter with diacritic s will be sorted differently from the same base letter without diacritics. There are two approaches to supporting multiple languages: if there is a small list of languages you wish to support, consider defining collated fields for each language and using copyField. However, adding a large number of sort fields can increase disk and indexing costs.
An alternative approach is to use the Unicode default collator. To use the default locale, simply define the locale as the empty string. This Unicode default sort is still significantly more advanced than the standard Solr sort.
You can define your own set of sorting rules. In the example below, we create a custom rule set for German called DIN This example shows how to create a custom rule set for solr. ICUCollationField and dump it to a file:. The principles of JDK Collation are the same as those of ICU Collation; you just specify language , country and variant arguments instead of the combined locale argument.
This can increase recall by causing more matches. On the other hand, it can reduce precision because language-specific character differences may be lost. This filter converts any character in the Unicode "Decimal Number" general category Nd into their equivalent Basic Latin digits In addition to these analysis components, Solr also provides an update request processor to extract named entities - see Update Processor Factories That Can Be Loaded as Plugins.
To use the OpenNLP components, you must add additional. The OpenNLP Tokenizer takes two language-specific binary model files as parameters: a sentence detector model and a tokenizer model. The last token in each sentence is flagged, so that following OpenNLP-based filters can use this information to apply operations to tokens one sentence at a time.
See the OpenNLP website for information on downloading pre-trained models. Only index nouns - the keep. Index the phrase chunk label for each token as a synonym, after prefixing it with " " see the TypeAsSynonymFilter description :. This filter replaces the text of each token with its lemma. Both a dictionary-based lemmatizer and a model-based lemmatizer are supported. If both are configured, the dictionary-based lemmatizer is tried first, and then the model-based lemmatizer is consulted for out-of-vocabulary tokens.
Either dictionary or lemmatizerModel must be provided, and both may be provided - see the examples below:. Perform dictionary-based lemmatization, and fall back to model-based lemmatization for out-of-vocabulary tokens see the OpenNLP Part-Of-Speech Filter section above for information about using TypeTokenFilter to avoid indexing punctuation :.
Perform model-based lemmatization only, preserving the original token and emitting the lemma as a synonym see the KeywordRepeatFilterFactory description :. These factories are each designed to work with specific languages. The languages covered here are:. Solr provides support for the Light PDF stemming algorithm, and Lucene includes an example stopword list.
This algorithm defines both character normalization and stemming, so these are split into two filters to provide more flexibility. Factory classes: solr. ArabicStemFilterFactory , solr. There are two filters written specifically for dealing with Bengali language.
They use the Lucene classes org. BengaliNormalizationFilter and org. BengaliStemFilterFactory , solr. This is a Java filter written specifically for stemming the Brazilian dialect of the Portuguese language.
It uses the Lucene class org. Although that stemmer can be configured to use a list of protected words which should not be stemmed , this factory does not accept any arguments to specify such a list.
Solr includes a light stemmer for Bulgarian, following this algorithm PDF , and Lucene includes an example stopword list. Solr includes a set of contractions for Catalan, which can be stripped using solr. It follows the Word Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to segment Chinese words.
To use this tokenizer, you must add additional. Kohonen Self Organizing Maps Java applet. Genetic Algorithms in Javascript. Articles Book review - Neuro Web Design, How to create a collaborative filtering recommendation system using Apache Mahout.
Synonyms in Apache Solr. How to implement a spell check component in Apache Solr. How to connect your database to Apache Solr. Introduction to Artificial Life Alife. Apr 29, Jul 13, Remove invalid word 'acceleratorh'. Aug 24, Jun 1, Removed zoogrpahy and remade archives. Mar 13, Oct 20, View code.
0コメント