Incremental Alfresco connector for Lucidworks Fusion
As an Enterprise Content Management solution managing content from various sources, of various formats, and in various languages, Alfresco aims to structure, protect and make data available to end-users and to other applications.
Search and related functionality like facets, suggestions, highlights are important pieces in realizing these goals, in both structured (metadata/properties) and unstructured (content) data.
Lucidworks Fusion provides a search platform for verticals like eCommerce, customer support, enterprise search. It aims to not only slice and dice the data but also to enrich it via AI techniques in a flexible and scalable way. The two products can naturally help each other, for an improved search experience.
Existing Lucidworks fusion connector
Alfresco has already been integrated with Fusion via a connector that adheres to the CMIS (Content Management Interoperability Services) standard.
This Lucidworks Fusion connector is a crawler that traverses the tree of documents from Alfresco and indexes them. Subsequent passes look at the modification date of documents and only continue traversing their subtrees if this date is newer than the date of the last crawl.
Authorities that have at least read permissions are indexed together with documents. A change in permissions on the Alfresco side triggers a reindexing of the subtree at the next crawl cycle.
At query time the connector sends first a call to Alfresco to get the groups current user belongs to. Afterwards, the security filter appends a terms query to the main query:
Incremental Lucidworks fusion connector
The main problem with the existing Lucidworks Fusion connector is its non-incremental nature, or rather the way incrementality is implemented. For a big repository, a crawl cycle can take a very long time and it involves unnecessary operations/checks on many nodes. As a consequence, it takes some time for new data added to Alfresco to reach the index, therefore increasing what in Alfresco’s Solr is called “eventual consistency”.
A second problem is the indexing of permissions together with documents, when in fact they are modeled as 2 separate hierarchies in Alfresco. More details about this are in the following sections.
Also, in Alfresco it’s possible to deny specific authority access to documents, while in the existing connector this use case is not covered.
In order to address these problems, a new connector was built, operating in a similar fashion to Alfresco’s Solr tracker. A short overview of the way Solr tracks Alfresco can be read in our article Search Indexing Alfresco.
The connector is available on Xenit’s GitHub, for the moment as a private project: https://github.com/xenit-eu/alfresco-fusion-connector.
In practice, there are 2 connectors running in parallel: one for metadata and content and a second one for permissions. They add data in 2 separate collections in Fusion.
Tracking metadata and content
Although this post is not meant to go into all details of the connector’s implementation, for a better understanding it is necessary to sketch the high-level algorithm. In a single-thread scenario, one fetcher cycle looks as follows:
- Loop transactionBatchesToFetch times
- ask Alfresco for the last transaction from the database
- ask Alfresco for a batch of transactions to be processed and split them into the update and delete transactions
- the first and last transaction in the batch is computed based on the result of the previous query, relevant input parameters, and/or the previous fetch cycle
- for update transactions:
- retrieve the list of nodes involved from Alfresco
- for each batch of nodes (nodeBatchSize)
- retrieve metadata and (possibly) content from Alfresco
- index the nodes
- for delete transactions:
- retrieve the list of nodes involved from Alfresco
- for each batch of nodes (nodeBatchSize)
- delete the nodes from the index
This algorithm works for both catch-up and a live-tracking scenario, via a scheduler provided by Fusion.
Communication with Alfresco is done using a custom REST client built by Xenit: https://github.com/xenit-eu/alfresco-remote-api-clients.
This client was developed specifically to be included in spring applications and causes some log pollution when used in Fusion. In future updates, this client will be updated to a simpler HTTP client.
More details about the parameters used in the README of the component: https://github.com/xenit-eu/alfresco-fusion-connector/tree/master/alfresco-data.
In order to speed up the tracking process, a fetcher can run on multiple threads. This functionality is included in the connector SDK provided by Fusion. The job of the connector is to define candidates to be executed in parallel.
There are 3 index strategies implemented (see diagram).
- index metadata only
- index metadata and content in 1 step (MC_1S)
- index metadata and content in 2 steps (MC_2S)
In strategy 2, metadata and content are indexed synchronously, in one step. It takes longer to have the search results, but – if full-text search is required – this approach is to be preferred. It has the advantage of simplicity and also the fact that a document is indexed only once.
In strategy 3, metadata is indexed first, then the content, asynchronously. In this scenario, when a document is handled, it is both indexed and kept as a transient candidate for further processing. When the candidate has been enriched with content, a new document is emitted for indexing, overwriting the previous one. The approach has the advantage that search results are available quicker in a search interface, first metadata-only.
The candidates to be picked up by the fetcher threads are:
- transaction ranges: a batch of transactions
- nodes: a node with metadata, to be enriched with content in scenario MC_2S
Both types of candidates are transient, i.e. once handled they are removed from the local DB with candidates.
Access control lists (acls) are lists of pairs (authority, permission) specifying who has access to a document and what kind of access that is. For Solr – at least current implementation from Alfresco – only the read permission is interesting. Authorities allowed to read a document are called readers. Authorities who are explicitly forbidden to read a document are called denies. Each time permissions are changed, a number of acls are involved in a changeset (= acl transaction).
The high-level algorithm for one acl fetcher cycle is as follows:
- ask from Alfresco a batch of changesets to be processed
- the first changeset in the batch or the first commit time of the changeset is computed based on relevant input parameters and/or previous fetch cycle
- for changesets
- ask from Alfresco the list of acls involved
- for acls
- ask from Alfresco the list of readers and denies
- index each acl + its readers + its denied authorities
This algorithm works both for a catch-up and a live-tracking scenario. Indexing the acls is much faster than indexing documents, since acls hierarchy is smaller, acls have no content and only a few metadata fields.
More details about the parameters used in the README of the component: https://github.com/xenit-eu/alfresco-fusion-connector/tree/master/alfresco-acls.
Advantages of tracking ACLS separately from documents
Decoupling tracking of documents from the tracking of permissions has a number of advantages:
- in general, the number of permissions is small compared to the number of documents. It makes sense to scale up by sharding the document collection and having a copy of the acls collection on each shard
- cascading permission changes to the subtree whose root was changed does not entail reindexing documents, only the acl hierarchy changes
- the implementation is in line with the implementation of other Fusion connectors (e.g. Sharepoint connector)
- the implementation is in line with the way permissions are modelled in Alfresco’s database
Querying: permission filtering
Data from the 2 connectors is linked via the AlfrescoSecurityFilterComponent. The role of this component is to keep only documents that are readable by the current user.
In other Fusion connectors, the security filter executes a graph+join query to flatten the acl collection and then to join it with the main collection. In the case of the incremental Alfresco connector, the graph query is not needed, since the list of readers and denies is already flattened in the acl collection, provided as such by Alfresco’s endpoints related to Solr.
Steps in the security filter:
- Get all groups the current user belongs to <group1> …<groupn>
- Create a filter query to
- Get documents that have ROLE_OWNER and are owned by the <crt_user>
- Get documents with acls whose readers contain the user itself or one of <group1> …<groupn>
- Filter out documents with acls who deny access to the user itself or to<group1> …<groupn>
The incremental Fusion connector follows the same tracking mechanism as Solr used by Alfresco. Why would anyone use this connector instead of the indexing server from Alfresco? There are a few reasons:
- Newer Solr version in Fusion (and faster upgrades)
- Freedom to use advanced features of Solr: join queries, local parameters
- Freedom to add new fields to the index
- Scaling out-of-the-box via zookeeper
- Fusion is nicely integrated with App Studio Enterprise, allowing to quickly build a search application
- Opening up possibilities for enterprise search across multiple sources
- Opening up possibilities for machine learning-based exploration and enrichment of data
The combination between Alfresco and Fusion can provide a powerful search experience with limited effort. The new incremental Lucidworks Fusions connector addresses problems that are not (easily) solved with the existing CMIS-based connector.