Solr does everything that I would've written, anyway. It accepts and returns data over HTTP, as XML or JSON. Replication, caching, and a bunch of other shiny and fun things. Since it is running as a Java servlet, there's all kinds of stuff you stick in its way, and mega-configurable.
Retrieving data is pretty fast. Insertions, not so much. There isn't much data going back and forth, and I realize that insertions are, by their nature, slow. I suspect I could get better performance with fewer indexed elements, spread out among solr instances (the Servlet aspect makes this pretty easy).
One of the slowdowns I've pinpointed has to do with the .NET WebClient class. It sucks. It uses some Windows-provided HTTP API, and it isn't happy with multiple threads. It also limits itself to two connections at once (then appears to deadlock - but it might be my code). The HTTP specification says only two connections at once, and I think there's a way to override this in the Windows' registry.
Not interested in that path, especially if it is going to be this slow.
At the moment, there's a commit that occurs after every add. This is another slowdown, but the library I'm using, SolrSharp, doesn't make doing it another way very pretty. It shouldn't be too hard to fix - I've been dinking around in the code, and I see what changes I have to make. Its just a lot of refactoring. Bleh.
So, on the todo list are:
- a better HTTP client
- modify the addition of records to allow batching
No comments:
Post a Comment