[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Search subversion binary content

From: Daniel L. Rall <dlr_at_finemaltcoding.com>
Date: 2005-10-08 01:01:16 CEST

Ted, searching known binary formats is very feasible (although requires some
work). Malcolm's suggestion of making the repository content available to
an external search tool which already knows how to deal with the format
that you're interested in is a good one; doing such a thing in conjunction
with an indexing repository crawler would provide an even cleaner
integration than maintaining a -- possibly disk-space expensive -- shadow
repository of the content from modified revisions (and associated metadata).

In addition to the previously suggested tools, there are numerous
lower-level libraries which allow access to the various node types which
compose OLE documents. I have seen several Java and Perl tools used to
successfully implement indexing of this kind. The failure by Microsoft
to publish specs on its binary formats does not preclude reverse engineering
of their formats.

- Dan

On Fri, 07 Oct 2005, Ted Shab wrote:

> Daniel,
>
> A good example is Spotlight in OS X. It will search
> on zip files, word documents, pdf, etc. Obvioiusly
> Google searches on at least PDF...
>
> Thanks!
>
>
>
> --Ted
>
> --- "Daniel L. Rall" <dlr@finemaltcoding.com> wrote:
>
> > On Fri, 07 Oct 2005, Ted Shab wrote:
> >
> > > Hello,
> > >
> > > Is there a best practice for search subversion
> > binary
> > > content?
> >
> > By "binary content", I'm going to assume that you
> > literally mean searching
> > for any binary string of data (as opposed to
> > textual).
> >
> > Do many engines out there generate useful indices
> > without tokenization
> > patterns? Given binary content, are there any
> > tokens generate an index from,
> > a la natural language words or characters, or
> > patterns (in images, music,
> > etc.) which would work? If so, you might want to
> > generate an index of
> > the repository (via post-commit hook, periodic
> > background processesing, or
> > both).
> >
> > If not, you could use a primitive solution like
> > grep'ing a checkout of a
> > specific rev of the repository, or a more hands-on
> > approach like a crawler
> > which searched on-demand, walking the repository
> > based on a specified tree
> > and revision range, re-assembling each revision and
> > searching it.
> >
> > > What tools have people had experience using in
> > this
> > > manner?
> >
> > I haven't heard of anyone doing binary searching (as
> > described above).
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > dev-unsubscribe@subversion.tigris.org
> > For additional commands, e-mail:
> > dev-help@subversion.tigris.org
> >
> >
>
>
>
>
> __________________________________
> Yahoo! Mail - PC Magazine Editors' Choice 2005
> http://mail.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 8 01:02:19 2005

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.