⧼vector-jumptocontent⧽

Indexing.pl

From EPrints Documentation


Back to cfg.d

This file contains configuration for indexing data objects.

In particular this has configuration for whether indexing is enabled and if so the following configuration rules:

  • $c->{indexing}->{freetext_min_word_size} - The minimum length a word in free-text field has to be to be indexed. The default is 3.
  • $c->{indexing}->{freetext_stop_words} - Words that should not be indexed in free-text fields, as they are too common (e.g. and, are, the, you, etc.).
  • $c->{indexing}->{freetext_seperator_chars} - Characters that separate two separate words in a free-text field (e.g. colon :, equals = hyphen -, full stop ., space , etc.). N.B. seperator was a typo in the codebase that cannot now be fixed for legacy reasons.

The file also contains the extract_words function for how individual words should be extracted from free-text. This may vary across different types of repository and some repositories may have edge cases they need to handle, so this has be purposefully designed as a user-defined function to facilitate bespoke requirements.