Monday, August 31, 2009

Where's That File?

This post starts out with a reading of another blog, but it isn't outright babble. It's about what I'm working on.

The author of this article claims "you have to think of content entirely abstractly". While there is some exposition as to what it should look like, it is very vague: "your system should be capable of managing any kind of content."

Fair enough, but how?

Well, that's what I've been working on. I think that the various types of data are best handled by programs specifically designed to handle that data. What we as users need is an easy way to find it.

The current solutions tend to involve centralization, synchronization, and search. You're supposed to keep all the important data centralized, if you need to organize it your own way then you synchronize it, and if you're looking for something you search for it.

Which is great, except that users don't do this, because it all sucks.

If I download file from the internet, that file exists in two places which I can get to. My download folder, and the original link. If I copy it up to a CMS, now it is in three places. If that CMS is backed up, it exists in four places. Copy it to a thumb drive? Now I'm up to five.

Five copies of the same file, in locations which are all equally valid, and all have their strengths and weaknesses. Between them, the data is unlikely to be completely irretrievable.

Now, as a user, all I want to know is "where's that file?" (thus the name of the project)

The author of the original article was correct in that the only important thing is the metadata. What he doesn't seem to get is that the metadata is the only content which needs to be managed.

Currently, the problem I'm solving is strictly a question of duplicate files on the network. I have files that I know must be backed up, but I don't know where all of those copies are. I don't want too many copies, because storage costs are on a rising curve: Each additional terabyte costs more than the previous terabyte.

Turns out, solving this problem isn't easy (my first naive implementations didn't scale), and a whole bunch of the work can be extended to other storage sources.

Having that, though, the next obvious step is to include personal metadata (tags, descriptions) to the files. You have to collect and index metadata, anyway (file name, size, etc.), so why not add user metadata, too?

What I'd expect to see at that point is a UI which reflects the various metadata. If I'm looking for my resume, I should be able to not only find "resume.doc", I should know about all of the copies of "resume.doc" I know about, even if I can't get to them. I'd prefer that the "nearest" one be highlighted in some way, things like that.

What I'd like to do after that (as if I didn't want to do enough), is assign rules to various tags. If I label something with "important", then it should be included in a special backup/sync/whatever. Again, this isn't something that will be particularly difficult, but will require effort.

Well, that's cool, but what about other storage sources? Those are a bit harder, and generally specific to that storage (email, for example). However, things like links to articles and downloads is pretty straightforward, and shouldn't be too hard to include.

Where am I now?

Heh. I mentioned that looking for duplicate files is harder than I thought it would be. I'm actually on my third try. The first one was when I thought "I can do this with a script", the second was with .NET, where I aimed bigger, but found not nearly big enough.

So, I've just completed the work on the file crawler, and the next bit is submitting the crawl results to the index. I've done this part before, and I don't expect it to be particularly hard, but I have to find the time for it. After that, something resembling a UI (I am trying to solve a problem), then put the whole thing out there with a big fat "alpha" disclaimer (probably Apache license, since I'm using so much of their stuff).

And that's what I'm doing, and where I'm at.

No comments:

Post a Comment