A few elements in the interface are specific and and need an explanation.
An udi (unique document identifier) identifies a document. Because of limitations inside the index engine, it is restricted in length (to 200 bytes), which is why a regular URI cannot be used. The structure and contents of the udi is defined by the application and opaque to the index engine. For example, the internal file system indexer uses the complete document path (file path + internal path), truncated to length, the suppressed part being replaced by a hash value.
This data value (set as a field in the Doc
object) is stored, along with the URL, but not indexed by
Recoll. Its contents are not interpreted, and its use is up
to the application. For example, the Recoll internal file
system indexer stores the part of the document access path
internal to the container file (ipath in
this case is a list of subdocument sequential numbers). url
and ipath are returned in every search result and permit
access to the original document.
The fields file inside
the Recoll configuration defines which document fields are
either "indexed" (searchable), "stored" (retrievable with
search results), or both.
Data for an external indexer, should be stored in a separate index, not the one for the Recoll internal file system indexer, except if the latter is not used at all). The reason is that the main document indexer purge pass would remove all the other indexer's documents, as they were not seen during indexing. The main indexer documents would also probably be a problem for the external indexer purge operation.
Recoll versions after 1.11 define a Python programming interface, both for searching and indexing. The indexing portion has seen little use, but the searching one is used in the Recoll Ubuntu Unity Lens and Recoll Web UI.
The API is inspired by the Python database API specification. There were two major changes in recent Recoll versions:
recoll module became a
package (with an internal recoll
module) as of Recoll version 1.19, in order to add more
functions. For existing code, this only changes the way
the interface must be imported.
We will mostly describe the new API and package structure here. A paragraph at the end of this section will explain a few differences and ways to write code compatible with both versions.
The Python interface can be found in the source package,
under python/recoll.
The python/recoll/ directory
contains the usual setup.py. After
configuring the main Recoll code, you can use the script to
build and install the Python module:
cd recoll-xxx/python/recoll
python setup.py build
python setup.py install
The normal Recoll installer installs the Python API along with the main code.
When installing from a repository, and depending on the distribution, the Python API can sometimes be found in a separate package.
The recoll package contains two
modules:
The recoll module contains
functions and classes used to query (or update) the
index.
The rclextract module contains
functions and classes used to access document
data.
connect() function connects to
one or several Recoll index(es) and returns
a Db object.
confdir may specify
a configuration directory. The usual defaults
apply.extra_dbs is a list of
additional indexes (Xapian directories). writable decides if
we can index new data through this
connection.A Db object is created by
a connect() call and holds a
connection to a Recoll index.
Methods
Db object after
this.Query object
for this index.maxchars defines the
maximum total size of the abstract.
contextwords defines how many
terms are shown around the keyword.match_type
can be either
of wildcard, regexp
or stem. Returns a list of terms
expanded from the input expression.
A Query object (equivalent to a
cursor in the Python DB API) is created by
a Db.query() call. It is used to
execute index searches.
Methods
fieldname, in ascending
or descending order. Must be called before executing
the search.query_string, a Recoll
search language string.Doc objects in the current
search results, and returns them as an array of the
required size, which is by default the value of
the arraysize data member.Doc object
from the current search results.mode can
be relative
or absolute. ishtml
can be set to indicate that the input text is HTML and
that HTML special characters should not be escaped.
methods if set should be an object
with methods startMatch(i) and endMatch() which will be
called for each match and should return a begin and end
tagdoc (a Doc
object) by selecting text around the match terms.
If methods is set, will also perform highlighting. See
the highlight method.
for doc in
query: will work.Data descriptors
scroll()). Starts at
0.A Doc object contains index data
for a given document. The data is extracted from the
index when searching, or set by the indexer program when
updating. The Doc object has many attributes to be read or
set by its user. It matches exactly the Rcl::Doc C++
object. Some of the attributes are predefined, but,
especially when indexing, others can be set, the name of
which will be processed as field names by the indexing
configuration. Inputs can be specified as Unicode or
strings. Outputs are Unicode objects. All dates are
specified as Unix timestamps, printed as strings. Please
refer to the rcldb/rcldoc.h C++ file
for a description of the predefined attributes.
At query time, only the fields that are defined
as stored either by default or in
the fields configuration file will be
meaningful in the Doc
object. Especially this will not be the case for the
document text. See the rclextract
module for accessing document contents.
Methods
A SearchData object allows building
a query by combining clauses, for execution
by Query.executesd(). It can be used
in replacement of the query language approach. The
interface is going to change a little, so no detailed doc
for now...
Methods
Index queries do not provide document content (only a
partial and unprecise reconstruction is performed to show the
snippets text). In order to access the actual document data,
the data extraction part of the indexing process
must be performed (subdocument access and format
translation). This is not trivial in
general. The rclextract module currently
provides a single class which can be used to access the data
content for result documents.
Methods
Extractor object is
built from a Doc object, output
from a query.ipath and return
a Doc object. The doc.text field
has the document text converted to either text/plain or
text/html according to doc.mimetype. The typical use
would be as follows:
qdoc = query.fetchone() extractor = recoll.Extractor(qdoc) doc = extractor.textextract(qdoc.ipath) # use doc.text, e.g. for previewing
qdoc = query.fetchone() extractor = recoll.Extractor(qdoc) filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)
The following sample would query the index with a user
language string. See the python/samples
directory inside the Recoll source for other
examples. The recollgui subdirectory
has a very embryonic GUI which demonstrates the
highlighting and data extraction functions.
#!/usr/bin/env python
from recoll import recoll
db = recoll.connect()
db.setAbstractParams(maxchars=80, contextwords=4)
query = db.query()
nres = query.execute("some user question")
print "Result count: ", nres
if nres > 5:
nres = 5
for i in range(nres):
doc = query.fetchone()
print "Result #%d" % (query.rownumber,)
for k in ("title", "size"):
print k, ":", getattr(doc, k).encode('utf-8')
abs = db.makeDocAbstract(doc, query).encode('utf-8')
print abs
print
The following code fragments can be used to ensure that code can run with both the old and the new API (as long as it does not use the new abilities of the new API of course).
Adapting to the new package structure:
try:
from recoll import recoll
from recoll import rclextract
hasextract = True
except:
import recoll
hasextract = False
Adapting to the change of nature of
the next Query
member. The same test can be used to choose to use
the scroll() method (new) or set
the next value (old).
rownum = query.next if type(query.next) == int else \
query.rownumber