[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: MIME type meta-data

From: Matthew Hambley <matthew_at_aether.demon.co.uk>
Date: 2003-03-05 08:20:37 CET

In message <3E65157A.5070806@xbc.nu>
          Branko ÄŒibej <brane@xbc.nu> wrote:

> B. W. Fitzpatrick wrote:
>
> >Matthew Hambley <matthew@aether.demon.co.uk> writes:
> >
[snip]
> > > If this is what people are after than I am willing to float the idea
> > > (and do the work) on the APR mailing list. If, on the other hand,
> > > people had something else in mind then I wont bother. Clearly it was
> > > not something considered useful to Apache otherwise it would be in
> > > there already.
> >
> > I've actually been interested in a facility like this for quite some
> > time. I'd love to hear you out on any designs that you come up
> > with--I'd love to see something like this in Subversion.
>
> May I be allowed to express interest, too? :-)
[snip]

Go on then. :-)

Warning, weighty musings follow:

Based on my own thinking and what has been said already in this thread I
can see a number of solutions to the problem of identifying the correct
MIME type for a file.

First off is the statistical method used currently. This has the advantage
that it is mechanical and does not rely on the user to know what they are
doing. It's failings are that it can tell the difference between text and
binary (based on an arbitrary cut-off level) but can't tell what *type* of
text or binary it's looking at. It is also expensive in that it requires a
scan of the file.

So my first idea for an alternative was to use a system to map properties
of the file to a MIME type. As has been pointed out this isn't necessarily
a file extension to MIME map as different operating systems have different
ways of solving the mapping problem. It is however a nicely contained
problem for which the system specific portion can be wrapped up neatly.
The advantages of this solution is that it can differentiate between
different types of file and the information is commonly collected and
stored by operating systems anyway so our code need be merely a wrapper
around that. The disadvantage is that it relies on the user to identify a
file with the correct tag before adding it to the repository. Given that
the users of Subversion will be mostly technical people is this an
unreasonable assumption? We also have to bare in mind that not every type
will appear in the MIME map, thus we will need a backup method to cope in
this situation.

A third idea suggested is to use a "magic bytes" system which interrogates
the file for clues to its type. This can be considered the turbo nutter
bastard half brother of the statistical method. It has the same advantages
but solves the problem of not being able to distinguish between different
types beyond text and binary. It will probably be less computationaly
expensive as well since it is unlikely to have to scan the whole file,
merely a portion of it. One feature it shares with the map system is that
it will have to cope with the situation where a file can not be identified.
It would also require considerable coding on some platforms since this
system is, too the best of my knowledge, peculiarly *nix.

In conclusion:

Whatever more advanced system is chosen chances are the current statistical
method will be retained to act as a fall-back for when the clever algorithm
doesn't come up with the goods.

Magic bytes and statistical scan could be combined quite neatly in a single
action but I prefer the MIME map approach as it seems more portable.

And finally:

The question of where this checking should occur, client or server, was
raised. Although strong opinion was voiced against having it happen on
the client I would like to argue for it. I would suggest that the client
is in a much better position to know what type a particular file is than
the server. Remember this processing is only going to happen when a file
is added or imported. In which case the file was created on the client,
therefore it seems sensible that the client should know what it is.

-- 
(\/)atthew )-(ambley ---------------\ If something's worth doing it's worth
E-mail : matthew@aether.demon.co.uk  \ doing badly until you can learn to
Public key : C991137B                 \ do it well.
Web : http://www.aether.demon.co.uk/   \-----------------------------------
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Mar 5 08:21:48 2003

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.