Monday, August 31, 2009

Where's That File?

This post starts out with a reading of another blog, but it isn't outright babble. It's about what I'm working on.

The author of this article claims "you have to think of content entirely abstractly". While there is some exposition as to what it should look like, it is very vague: "your system should be capable of managing any kind of content."

Fair enough, but how?

Well, that's what I've been working on. I think that the various types of data are best handled by programs specifically designed to handle that data. What we as users need is an easy way to find it.

The current solutions tend to involve centralization, synchronization, and search. You're supposed to keep all the important data centralized, if you need to organize it your own way then you synchronize it, and if you're looking for something you search for it.

Which is great, except that users don't do this, because it all sucks.

If I download file from the internet, that file exists in two places which I can get to. My download folder, and the original link. If I copy it up to a CMS, now it is in three places. If that CMS is backed up, it exists in four places. Copy it to a thumb drive? Now I'm up to five.

Five copies of the same file, in locations which are all equally valid, and all have their strengths and weaknesses. Between them, the data is unlikely to be completely irretrievable.

Now, as a user, all I want to know is "where's that file?" (thus the name of the project)

The author of the original article was correct in that the only important thing is the metadata. What he doesn't seem to get is that the metadata is the only content which needs to be managed.

Currently, the problem I'm solving is strictly a question of duplicate files on the network. I have files that I know must be backed up, but I don't know where all of those copies are. I don't want too many copies, because storage costs are on a rising curve: Each additional terabyte costs more than the previous terabyte.

Turns out, solving this problem isn't easy (my first naive implementations didn't scale), and a whole bunch of the work can be extended to other storage sources.

Having that, though, the next obvious step is to include personal metadata (tags, descriptions) to the files. You have to collect and index metadata, anyway (file name, size, etc.), so why not add user metadata, too?

What I'd expect to see at that point is a UI which reflects the various metadata. If I'm looking for my resume, I should be able to not only find "resume.doc", I should know about all of the copies of "resume.doc" I know about, even if I can't get to them. I'd prefer that the "nearest" one be highlighted in some way, things like that.

What I'd like to do after that (as if I didn't want to do enough), is assign rules to various tags. If I label something with "important", then it should be included in a special backup/sync/whatever. Again, this isn't something that will be particularly difficult, but will require effort.

Well, that's cool, but what about other storage sources? Those are a bit harder, and generally specific to that storage (email, for example). However, things like links to articles and downloads is pretty straightforward, and shouldn't be too hard to include.

Where am I now?

Heh. I mentioned that looking for duplicate files is harder than I thought it would be. I'm actually on my third try. The first one was when I thought "I can do this with a script", the second was with .NET, where I aimed bigger, but found not nearly big enough.

So, I've just completed the work on the file crawler, and the next bit is submitting the crawl results to the index. I've done this part before, and I don't expect it to be particularly hard, but I have to find the time for it. After that, something resembling a UI (I am trying to solve a problem), then put the whole thing out there with a big fat "alpha" disclaimer (probably Apache license, since I'm using so much of their stuff).

And that's what I'm doing, and where I'm at.

IT, Users, and Communication

I was going to let this article slide, and not get all meta-bloggy about it, but the rebuttal really tweaked me. It's all about what users get to install on their work machines, and IT's reaction.

They both miss the point, I think.

Mr. Manjoo related a story about Firefox, and the crowd cheered. If there was really that much demand for it, then it was a failure on the IT department's part to know that it was wanted, and if they knew that, not at least acknowledging it clearly. There's plenty of good reasons not to upgrade.

What Mr. Manjoo missed is that there are tradeoffs to the freedom to install whatever you want, most of them related to support. A lot of IT policy is driven by how much they have to provide that support. Less money means coarser support - heavily locked down machines, aggressive re-imaging, or similar. Things that don't require a lot of people time.

The confirmation bias that both articles triggered in me, though, was that it clearly showed that in neither case is the IT department and the users communicating.

Good IT is hard, not just because of the technology involved, but because you have to make long term decisions which will permit you to react to users ever-changing needs and wants.

Remember, we're here for them, not the other way around. When I walk into a shop that doesn't live that attitude, I know I'll find a lot of problems.

Wednesday, August 12, 2009

Digital Sharecropping? Hah!

Jeff Atwood of CodingHorror seems to have a problem with user-generated content. He calls it "digital sharecropping". He includes a black and white photo of black people working a dusty field, just in case you didn't get the reference.

The gist of his analogy goes like this:
  • Users put their own work into building their particular segment of a much larger site.
  • The much larger site puts ads next to the work, and reaps profits.
  • The user receives nothing in return.
It's that last part that isn't true. The site provides a cheap and easy means of publishing on the internet - much easier than doing it all yourself. This particular generation differentiates itself from GeoCities, et. al., by providing additional tools for tracking related users and topics.

I think few of the people who publish on these sites are unaware that the hoster is trying to make money off of their work. At the beginning of his article, he repeats a story about a woman who contributes to a site. She calls it a "labor of love".

I think she knows exactly what she is doing. It's a hobby, it keeps her busy, and satisfied. What is so difficult to understand about that?

I don't begrude venues the opportunity to make a profit for providing a comfortable environment. I know of few people who do (you dirty smelly hippie commies!) To be honest, I'm glad that Mr. Atwood at least thinks about the topic, but really: it isn't all that big a deal.

Monday, August 10, 2009

Java's Lots of Little Files

I ran into this article about "Next Generation Java Programming Style", at YCombinator. There was some interesting discussion about the overall effectiveness of these suggestions.

Part of the discussion involved commenter Xixi asking if anyone had followed point #5, "Use many, many objects with many interfaces". It turns out, I've been following that model. I started to reply there, but I recognized a blog post after a bit.

Here's my general workflow, and I've found it to be quite effective.

The linked-to article refers to interfaces as "roles", and that's probably the easiest way to think of them.

If I have a few related items which need to be handled, I first create an interface for it: IMessage (I realize it isn't the java convention to prefix interfaces with "I", but I prefer it - especially with as many as I ended up with). Add in a few obvious functions, send, recv.

Create the concrete class (the actual implementation): FooMessage. In this case, the messages would deal with "Foo". So, it has send, recv, and say count. Gotta know how many Foos we have, right?

Next up, the test class - but I'll get to it in a moment. This is where I write it in the overall workflow, but it doesn't make as much sense without talking about Mocks.

Last, I write the mock class for the concrete class. It also implements IMessage, but most of the implementation is empty - just accepts the parameter, and maybe spits back a default value.

Which brings us back to the test class. Since I refer to everything via its interface, using those mocks is easy. In FooMessageTest, I use the concrete class FooMessage, and a whole bunch of mocks. Generally, everything but the class being tested can use mocks, so testing ends up nicely isolated and repeatable.

In practice, concrete classes implement several interfaces (IFoo, specifying count, so you could have IBar, which specifies weight.)

Okay, this was a lot of work up-front, and I'll be honest: I approached it with some concern that it wouldn't pay off.

Well, it has. Refactoring, with the assistance of NetBeans, has been a breeze. Adding in new, or modifying old functionality has been super-easy. Yes, there's a few hangups, but they tend to revolve around my lack of planning than the overall process. I don't feel as "free" as when I use Ruby, but I don't feel held up by the language or its environment.

The hardest part has been maintaining discipline. It is really easy to think that this particular class doesn't need an interface, or a mock, etc - but that is no different than any other methodology.

Tuesday, August 4, 2009

Svnserve, and Solaris 10

I had to go through the trouble of getting svnserve to run as an SMF-managed service on Solaris 10, so there's no reason you should, too.

Create the method script.


This script uses rc-like syntax. The xml manifest (coming up!) uses this.

vi /lib/svc/method/svc-svnserve

The contents:
#!/sbin/sh

case $1 in
start)
svnserve -r /var/svnroot -d ;;
stop)
/usr/bin/pkill -x -u 0 svnserve ;;
*)
echo Usage is $0 { start | stop }
exit 1 ;;
esac

exit 0

Fix the permissions:

chmod 555 /lib/svc/method/svc-svnserve
chown root:bin /lib/svc/method/svc-svnserve

Test it with:

sh /lib/svc/method/svc-svnserve start

Try to connect, list, etc., make sure it works the way you want it to.

Create the SMF manifest


vi /var/svc/manifest/site/svnserve.xml

The manifest, itself


<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type='manifest' name='SUNWsvn:svnserve'>
<service
name='site/svnserve'
type='service'
version='1'>
<single_instance/>
<dependency
name='loopback'
grouping='require_all'
restart_on='error'
type='service'>
<service_fmri value='svc:/network/loopback:default'/>
</dependency>

<exec_method
type='method'
name='start'
exec='/lib/svc/method/svc-svnserve start'
timeout_seconds='30' />
<exec_method
type='method'
name='stop'
exec='/lib/svc/method/svc-svnserve stop'
timeout_seconds='30' />
<property_group name='startd' type='framework'>
<propval name='duration' type='astring' value='contract'/>
</property_group>
<instance name='default' enabled='true' />
<stability value='Unstable' />
<template>
<common_name>
<loctext xml:lang='C'>
New service
</loctext>
</common_name>
</template>
</service>
</service_bundle>

Check your work


Check the xml with:

xmllint --valid /var/svc/manifest/site/svnserve.xml

Then let's see if the smf stuff likes it:

svccfg validate /var/svc/manifest/site/svnserve.xml

If everything looks good so far...

Importing the manifest


svccfg import /var/svc/manifest/site/svnserve.xml

It should show up under svcs in maintenance. Let's fix that:

svcadm enable svnserve:default

If it doesn't start, check /var/svc/log/site-svnserve:default.log

You should be all nicely integrated now.

Monday, August 3, 2009

Fun With Projects!

Way back in the day, around April or so, I was talking about a project which was taking up my time. Yes, it is still coming along nicely. We're playing nice with ActiveMQ, Java, the whole bit.

It has taken awhile to get as far as I have. It isn't that the underlying concept is all that difficult ("index files"), it is the scale at which I want to do it. So, there's been a lot of internal abstraction going on, with all of the attendant complexity (lots of little files).

What I'm really happy about is the overall process I've been following. I've been a proponent of tests and mocks, and I've used them a lot in my projects before. The one mistake I always made, that everyone always makes, is losing discipline - giving into the urge to cut a corner. After all, I won't need a mock for that class, it's too simple, right?

I haven't done that this time around. It is really paying off. I haven't been able to devote 100% of my time to this, so I've walked away more than once. I have had no trouble picking up where I was. New pieces work excellently with older pieces, and I barely question the predictability of anything I've done so far.

There's more work to do, of course. I see the light at the end of the tunnel, though. There's some obvious performance changes I can make, but once I've got the basic "duplicate files" functionality going, I'll post it all someplace.