Monday, August 31, 2009
Where's That File?
The author of this article claims "you have to think of content entirely abstractly". While there is some exposition as to what it should look like, it is very vague: "your system should be capable of managing any kind of content."
Fair enough, but how?
Well, that's what I've been working on. I think that the various types of data are best handled by programs specifically designed to handle that data. What we as users need is an easy way to find it.
The current solutions tend to involve centralization, synchronization, and search. You're supposed to keep all the important data centralized, if you need to organize it your own way then you synchronize it, and if you're looking for something you search for it.
Which is great, except that users don't do this, because it all sucks.
If I download file from the internet, that file exists in two places which I can get to. My download folder, and the original link. If I copy it up to a CMS, now it is in three places. If that CMS is backed up, it exists in four places. Copy it to a thumb drive? Now I'm up to five.
Five copies of the same file, in locations which are all equally valid, and all have their strengths and weaknesses. Between them, the data is unlikely to be completely irretrievable.
Now, as a user, all I want to know is "where's that file?" (thus the name of the project)
The author of the original article was correct in that the only important thing is the metadata. What he doesn't seem to get is that the metadata is the only content which needs to be managed.
Currently, the problem I'm solving is strictly a question of duplicate files on the network. I have files that I know must be backed up, but I don't know where all of those copies are. I don't want too many copies, because storage costs are on a rising curve: Each additional terabyte costs more than the previous terabyte.
Turns out, solving this problem isn't easy (my first naive implementations didn't scale), and a whole bunch of the work can be extended to other storage sources.
Having that, though, the next obvious step is to include personal metadata (tags, descriptions) to the files. You have to collect and index metadata, anyway (file name, size, etc.), so why not add user metadata, too?
What I'd expect to see at that point is a UI which reflects the various metadata. If I'm looking for my resume, I should be able to not only find "resume.doc", I should know about all of the copies of "resume.doc" I know about, even if I can't get to them. I'd prefer that the "nearest" one be highlighted in some way, things like that.
What I'd like to do after that (as if I didn't want to do enough), is assign rules to various tags. If I label something with "important", then it should be included in a special backup/sync/whatever. Again, this isn't something that will be particularly difficult, but will require effort.
Well, that's cool, but what about other storage sources? Those are a bit harder, and generally specific to that storage (email, for example). However, things like links to articles and downloads is pretty straightforward, and shouldn't be too hard to include.
Where am I now?
Heh. I mentioned that looking for duplicate files is harder than I thought it would be. I'm actually on my third try. The first one was when I thought "I can do this with a script", the second was with .NET, where I aimed bigger, but found not nearly big enough.
So, I've just completed the work on the file crawler, and the next bit is submitting the crawl results to the index. I've done this part before, and I don't expect it to be particularly hard, but I have to find the time for it. After that, something resembling a UI (I am trying to solve a problem), then put the whole thing out there with a big fat "alpha" disclaimer (probably Apache license, since I'm using so much of their stuff).
And that's what I'm doing, and where I'm at.
IT, Users, and Communication
They both miss the point, I think.
Mr. Manjoo related a story about Firefox, and the crowd cheered. If there was really that much demand for it, then it was a failure on the IT department's part to know that it was wanted, and if they knew that, not at least acknowledging it clearly. There's plenty of good reasons not to upgrade.
What Mr. Manjoo missed is that there are tradeoffs to the freedom to install whatever you want, most of them related to support. A lot of IT policy is driven by how much they have to provide that support. Less money means coarser support - heavily locked down machines, aggressive re-imaging, or similar. Things that don't require a lot of people time.
The confirmation bias that both articles triggered in me, though, was that it clearly showed that in neither case is the IT department and the users communicating.
Good IT is hard, not just because of the technology involved, but because you have to make long term decisions which will permit you to react to users ever-changing needs and wants.
Remember, we're here for them, not the other way around. When I walk into a shop that doesn't live that attitude, I know I'll find a lot of problems.
Wednesday, August 12, 2009
Digital Sharecropping? Hah!
The gist of his analogy goes like this:
- Users put their own work into building their particular segment of a much larger site.
- The much larger site puts ads next to the work, and reaps profits.
- The user receives nothing in return.
I think few of the people who publish on these sites are unaware that the hoster is trying to make money off of their work. At the beginning of his article, he repeats a story about a woman who contributes to a site. She calls it a "labor of love".
I think she knows exactly what she is doing. It's a hobby, it keeps her busy, and satisfied. What is so difficult to understand about that?
I don't begrude venues the opportunity to make a profit for providing a comfortable environment. I know of few people who do (you dirty smelly hippie commies!) To be honest, I'm glad that Mr. Atwood at least thinks about the topic, but really: it isn't all that big a deal.
Monday, August 10, 2009
Java's Lots of Little Files
Part of the discussion involved commenter Xixi asking if anyone had followed point #5, "Use many, many objects with many interfaces". It turns out, I've been following that model. I started to reply there, but I recognized a blog post after a bit.
Here's my general workflow, and I've found it to be quite effective.
The linked-to article refers to interfaces as "roles", and that's probably the easiest way to think of them.
If I have a few related items which need to be handled, I first create an interface for it:
IMessage
(I realize it isn't the java convention to prefix interfaces with "I", but I prefer it - especially with as many as I ended up with). Add in a few obvious functions, send
, recv
.Create the concrete class (the actual implementation):
FooMessage
. In this case, the messages would deal with "Foo". So, it has send
, recv
, and say count
. Gotta know how many Foos we have, right?Next up, the test class - but I'll get to it in a moment. This is where I write it in the overall workflow, but it doesn't make as much sense without talking about Mocks.
Last, I write the mock class for the concrete class. It also implements
IMessage
, but most of the implementation is empty - just accepts the parameter, and maybe spits back a default value.Which brings us back to the test class. Since I refer to everything via its interface, using those mocks is easy. In
FooMessageTest
, I use the concrete class FooMessage
, and a whole bunch of mocks. Generally, everything but the class being tested can use mocks, so testing ends up nicely isolated and repeatable. In practice, concrete classes implement several interfaces (
IFoo
, specifying count
, so you could have IBar
, which specifies weight
.)Okay, this was a lot of work up-front, and I'll be honest: I approached it with some concern that it wouldn't pay off.
Well, it has. Refactoring, with the assistance of NetBeans, has been a breeze. Adding in new, or modifying old functionality has been super-easy. Yes, there's a few hangups, but they tend to revolve around my lack of planning than the overall process. I don't feel as "free" as when I use Ruby, but I don't feel held up by the language or its environment.
The hardest part has been maintaining discipline. It is really easy to think that this particular class doesn't need an interface, or a mock, etc - but that is no different than any other methodology.
Tuesday, August 4, 2009
Svnserve, and Solaris 10
Create the method script.
This script uses rc-like syntax. The xml manifest (coming up!) uses this.
vi /lib/svc/method/svc-svnserve
The contents:
#!/sbin/sh
case $1 in
start)
svnserve -r /var/svnroot -d ;;
stop)
/usr/bin/pkill -x -u 0 svnserve ;;
*)
echo Usage is $0 { start | stop }
exit 1 ;;
esac
exit 0
Fix the permissions:
chmod 555 /lib/svc/method/svc-svnserve
chown root:bin /lib/svc/method/svc-svnserve
Test it with:
sh /lib/svc/method/svc-svnserve start
Try to connect, list, etc., make sure it works the way you want it to.
Create the SMF manifest
vi /var/svc/manifest/site/svnserve.xml
The manifest, itself
<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type='manifest' name='SUNWsvn:svnserve'>
<service
name='site/svnserve'
type='service'
version='1'>
<single_instance/>
<dependency
name='loopback'
grouping='require_all'
restart_on='error'
type='service'>
<service_fmri value='svc:/network/loopback:default'/>
</dependency>
<exec_method
type='method'
name='start'
exec='/lib/svc/method/svc-svnserve start'
timeout_seconds='30' />
<exec_method
type='method'
name='stop'
exec='/lib/svc/method/svc-svnserve stop'
timeout_seconds='30' />
<property_group name='startd' type='framework'>
<propval name='duration' type='astring' value='contract'/>
</property_group>
<instance name='default' enabled='true' />
<stability value='Unstable' />
<template>
<common_name>
<loctext xml:lang='C'>
New service
</loctext>
</common_name>
</template>
</service>
</service_bundle>
Check your work
Check the xml with:
xmllint --valid /var/svc/manifest/site/svnserve.xml
Then let's see if the smf stuff likes it:
svccfg validate /var/svc/manifest/site/svnserve.xml
If everything looks good so far...
Importing the manifest
svccfg import /var/svc/manifest/site/svnserve.xml
It should show up under
svcs
in maintenance. Let's fix that:svcadm enable svnserve:default
If it doesn't start, check
/var/svc/log/site-svnserve:default.log
You should be all nicely integrated now.
Monday, August 3, 2009
Fun With Projects!
It has taken awhile to get as far as I have. It isn't that the underlying concept is all that difficult ("index files"), it is the scale at which I want to do it. So, there's been a lot of internal abstraction going on, with all of the attendant complexity (lots of little files).
What I'm really happy about is the overall process I've been following. I've been a proponent of tests and mocks, and I've used them a lot in my projects before. The one mistake I always made, that everyone always makes, is losing discipline - giving into the urge to cut a corner. After all, I won't need a mock for that class, it's too simple, right?
I haven't done that this time around. It is really paying off. I haven't been able to devote 100% of my time to this, so I've walked away more than once. I have had no trouble picking up where I was. New pieces work excellently with older pieces, and I barely question the predictability of anything I've done so far.
There's more work to do, of course. I see the light at the end of the tunnel, though. There's some obvious performance changes I can make, but once I've got the basic "duplicate files" functionality going, I'll post it all someplace.