Thursday, April 16, 2009

Solr Indexing

Rather than go the DBMS route for my file index, I decided to go with Solr, from the Apache Foundation. I want to keep the index away from the data - for what I have in mind, they are separate things. It doesn't mean there won't be a regular database sitting around someplace, with its own indexes, it's just that I wanted this specific index to stand on its own.

Solr does everything that I would've written, anyway. It accepts and returns data over HTTP, as XML or JSON. Replication, caching, and a bunch of other shiny and fun things. Since it is running as a Java servlet, there's all kinds of stuff you stick in its way, and mega-configurable.

Retrieving data is pretty fast. Insertions, not so much. There isn't much data going back and forth, and I realize that insertions are, by their nature, slow. I suspect I could get better performance with fewer indexed elements, spread out among solr instances (the Servlet aspect makes this pretty easy).

One of the slowdowns I've pinpointed has to do with the .NET WebClient class. It sucks. It uses some Windows-provided HTTP API, and it isn't happy with multiple threads. It also limits itself to two connections at once (then appears to deadlock - but it might be my code). The HTTP specification says only two connections at once, and I think there's a way to override this in the Windows' registry.

Not interested in that path, especially if it is going to be this slow.

At the moment, there's a commit that occurs after every add. This is another slowdown, but the library I'm using, SolrSharp, doesn't make doing it another way very pretty. It shouldn't be too hard to fix - I've been dinking around in the code, and I see what changes I have to make. Its just a lot of refactoring. Bleh.

So, on the todo list are:
  • a better HTTP client
  • modify the addition of records to allow batching
Obviously, I have to start with logging.

No comments:

Post a Comment