⧼vector-jumptocontent⧽

Indexing.pl: Difference between revisions

From EPrints Documentation
Added page about config file
 
Added actually file name in bold.
 
(2 intermediate revisions by the same user not shown)
Line 2: Line 2:
{{cfgd}}
{{cfgd}}


This file contains configuration for indexing data objects.
'''indexing.pl''' contains configuration for indexing data objects.


In particular this has configuration for whether indexing is enabled and if so the following configuration rules:
In particular this has configuration for whether indexing is enabled and if so the following configuration rules:
* '''<code>$c->{indexing}->{freetext_min_word_size} </code>''' - The minimum length a word in free-text field has to be to be indexed.  The default is 3.
* '''<code>$c->{indexing}->{freetext_min_word_size} </code>''' - The minimum length a word in free-text field has to be to be indexed.  The default is 3.
* '''<code>$c->{indexing}->{freetext_stop_words}</code>''' - Words that should not be indexed in free-text fields, as they are too common (e.g. and, are, the, you, etc.).
* '''<code>$c->{indexing}->{freetext_stop_words}</code>''' - Words that should not be indexed in free-text fields, as they are too common (e.g. and, are, the, you, etc.).
* '''<code>$c->{indexing}->{freetext_seperator_chars}</code>''' - Characters that separate two separate words in a free-text field (e.g. colon <tt>:</tt>, equals <tt>=</tt> hyphen <tt>-</tt>, full stop <tt>.</tt>, space <tt> </tt>, etc.).
* '''<code>$c->{indexing}->{freetext_seperator_chars}</code>''' - Characters that separate two separate words in a free-text field (e.g. colon <tt>:</tt>, equals <tt>=</tt> hyphen <tt>-</tt>, full stop <tt>.</tt>, space <tt> </tt>, etc.). N.B. ''seperator'' was a typo in the codebase that cannot now be fixed for legacy reasons.


The file also contains the '''extract_words'' function for how individual words should be extracted from free-text.  This may vary across different types of repository and some repositories may have edge cases they need to handle, so this has be purposefully designed as a user-defined function to facilitate bespoke requirements.
The file also contains the '''extract_words''' function for how individual words should be extracted from free-text.  This may vary across different types of repository and some repositories may have edge cases they need to handle, so this has be purposefully designed as a user-defined function to facilitate bespoke requirements.

Latest revision as of 10:21, 30 January 2022


Back to cfg.d

indexing.pl contains configuration for indexing data objects.

In particular this has configuration for whether indexing is enabled and if so the following configuration rules:

  • $c->{indexing}->{freetext_min_word_size} - The minimum length a word in free-text field has to be to be indexed. The default is 3.
  • $c->{indexing}->{freetext_stop_words} - Words that should not be indexed in free-text fields, as they are too common (e.g. and, are, the, you, etc.).
  • $c->{indexing}->{freetext_seperator_chars} - Characters that separate two separate words in a free-text field (e.g. colon :, equals = hyphen -, full stop ., space , etc.). N.B. seperator was a typo in the codebase that cannot now be fixed for legacy reasons.

The file also contains the extract_words function for how individual words should be extracted from free-text. This may vary across different types of repository and some repositories may have edge cases they need to handle, so this has be purposefully designed as a user-defined function to facilitate bespoke requirements.