Alfresco + XML + Lucene

You are here

6 posts / 0 new
Last post
Alfresco + XML + Lucene


Alfresco is a great product, but it certainly loose a lot of advantages when indexing XML as simple text only.

We need to be able to store XML documents and perform a search in all documents where search terms appear in a given XML node. We don't need to have a complete XPath but just the parent node. E.g. search toto@footer where footer is an XML node of our document.

The complete XML structure is not known in advance and will certainly evolve with time.

We think this could be easily done at indexing time by dynamically adding a Lucene field for each XML tag encoutered, like Alfresco does for meta-data indexing.

The question: how to implement this in the nicest way? What Alfresco classes should be overwritten? It's quite difficult to understand now how alfresco interacts with Lucene.

Thanks for any advice!

P.S. The problematic is quite similar to http://forums.alfresco.com/viewtopic.php?t=277, but simplier in our case. Is Alfresco going to do something for XML documents indexing?

I would like to do something very similar, does anyone know the plans?

Same problem!


This is something we plan to address in the future. At the moment content including XML is just converted into text and indexed. You could write your own action to extract meta data and populate some predefined properties, just like there is for word docs and pdfs.

Properties need to be defined in the model. There is no support for any old property as defined by any tag you may find in the xml.

At the moment, you can extract information from elements in your xml docs into a defined property and then use that for search. An action would be best for this.

We do not support internal queries into XML documents. This does sound possible - but not using the current search API.

There is no reason why you can not add additional fields to the lucene index if you find an XML doc. You would have to alter LuceneIndexerImpl to do this if you find an XML type. You may also need to add support to determine the type of each field in LuceneAnalyser.



Andy Hind
Alfresco Development

Perhaps it could be useful to look at the Compass project. Compass addresses this kind of problem by means of an Object to Search Engine Mapping or XML to Search Engine Mapping. The bottom line is quite simple: a hibernate-like mapping is defined that maps object properties or xml fields to Lucene fields. In other words, an object or xml file with a certain layout is mapped to a Lucene document, which is then inserted into the index. Compass is open source so you can easily take a deeper look at how this is achieved...

kind regards


There is now support for XPATH extraction from XML documents into metadata. You can then search in the metadata and not the whole doc as transformed into text.


Andy Hind
Alfresco Development

forums index