Compass as a schema free database
Whilst I’ve been playing around with persistance over the last couple of days I read a page on the about swapping out a database for Lucene (via Compass).
My Compass (vs Hibernate Search) results from yesterday also had me thinking about why I was using a database at all. When using ORM you are often more interested in the easy storage of objects vs the properties of the database behind. Specifically are you aren’t using all the SQL muscle.
Depending on the data type (and database) you are storing database ORM might not have any advantages. For example for large amount of unstructured text or for objects where you don’t have a defined schema.
Today I had a quick play with Compass as a schema free database. Using the resource mapping you can simply create a object (a resource) and add properties (ie name, value pairs) to it. These resources are indexed by Compass and available through searching. Resource mapping allows you to configure how resources are pushed into Lucene (for example if they are stored, indexed, which analyser and how objects are converted to strings).
Since I haven’t posted any code in a long time, I’ll post the important pieces here:
META-INF/compass.cfg.xml:
<?xml version="1.0" ?> <compass-core-config xmlns="http://www.compass-project.org/schema/core-config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.compass-project.org/schema/core-config http://www.compass-project.org/schema/compass-core-config-2.0.xsd"> <compass name="default"> <connection> <file path="compass_index"/> </connection> <mappings> <resource location="META-INF/entity.cpm.xml" /> </mappings> </compass> </compass-core-config>
META-INF/entity.cpm.xml:
<?xml version="1.0" ?> <!DOCTYPE compass-core-mapping PUBLIC "-//Compass/Compass Core Mapping DTD 2.0//EN" "http://www.compass-project.org/dtd/compass-core-mapping-2.0.dtd"> <compass-core-mapping> <resource alias="entity"> <resource-id name="id" /> <resource-property name="contents" store="no" index="tokenized" /> </resource> </compass-core-mapping>
Main.java:
public static void main(String[] args) {
CompassConfiguration cfg = CompassConfigurationFactory.newConfiguration();
Compass compass = cfg.configure("/META-INF/compass.cfg.xml").buildCompass();
// This remove locks from previous crashed instances (useful when debugging trouble!)
compass.getSearchEngineIndexManager().releaseLocks();
CompassSession session = compass.openSession();
ResourceFactory resourceFactory = compass.getResourceFactory();
CompassTransaction tr;
int count=0;
String sourcePath = "/path/to/files";
// OPTIONAL: Turn the optimiser off here for increased speed
// compass.getSearchEngineOptimizer().stop();
// Use LUCENE transaction for faster batch processing
tr = session.beginTransaction(CompassTransaction.TransactionIsolation.LUCENE);
for (File file : new File(sourcePath).listFiles()) {
try {
Resource r = resourceFactory.createResource("entity");
r.addProperty("id", UUID.randomUUID().toString());
// Just for fun add a property we haven't mapped at compile time
r.addProperty(resourceFactory.createProperty("filename", file.getAbsolutePath(), Store.YES, Index.UN_TOKENIZED));
r.addProperty("contents", new FileReader(file));
session.save(r);
// Simple batching to improve speed
if(count++ % 1000 == 0) {
tr.commit();
tr = session.beginTransaction(CompassTransaction.TransactionIsolation.LUCENE);
}
} catch (IOException ex) {
logger.log(Level.SEVERE, null, ex);
tr.rollback();
}
}
// OPTIONAL: Turn the optimiser back on (if you turned it off and optimise)
// compass.getSearchEngineOptimizer().start();
// compass.getSearchEngineOptimizer().optimize();
// Now search
tr = session.beginTransaction();
CompassHits hits = session.find("Your search query here");
for(CompassHit hit : hits) {
Resource r = hit.getResource();
System.out.println(r.getValue("filename"));
}
tr.commit();
session.close();
compass.close();
}
I chose resources over POJO mapping because it’s ideally for allowing users to store meta-data information around an entity. New properties can be added and queried at runtime (though some control might be advisable to help find things later on).
I’ve got an specific application in mind where I want to use this, but I think it’s got wider utility as the web moves us more towards textual (meta-data) information. With a bit of work you could probably achieve a CouchDb in Lucene…
Posted in Blog