Re: format of svn:author

From: Mark Mielke <mark_at_mark.mielke.cc>
Date: Thu, 05 Jan 2012 02:30:16 -0500

On 01/04/2012 01:42 PM, Julian Foad wrote:
> A PROPOSAL FOR EXTENDED AUTHOR IDENTIFICATION
>
> USE CASES
>
> 1.[This one I am aware of.]
>
> A large company has authenticated user ids that are numeric. That means the "log" and "blame" information shown by most Subversion clients is not easy to understand. Therefore they use a (post-commit?) hook to change
> the svn:author property to a more friendly string, which (mostly) solves the display issue. However, it causes other problems. [What problems?]

Problems:

1) The unique identifier is no longer a direct match against external
identity management systems. For example, if svn:author is "Mark Mielke
(1234567)" and LDAP stores employeeNumber="124567" and cn="Mark Mielke",
very few tools support the ability to pattern match svn:author to pull
out character groups and to then lookup in an external identity
management system using the character group. I can't think of a single
tool that provides this capability out of box. In these tools, if I am
logged in as "1234567" it cannot know which commits are mine, because
"1234567" is not equal to "Mark Mielke (1234567)".

2) Users may end up with multiple unique identifiers over time due to
the unique identifier portion being combined with a more approximate
(and therefore inaccurate) humanly readable form. Display name or email
may change over time, and the ability to uniquely identify the author
becomes more complex as the mapping must include every instance
discovered at commit time. Some of this is subject to which identifier
is selected as the unique identifier - but let us say that a system such
as Forge is used and the identifier is some sort of username such as
"twoleftfeet". The email might start as "joe_at_doe.com", but end up as
"jdoe_at_acme.com". Any report around commits such as commits made per
user, or for a particular user - would either end up with split history
(treating the history as belong to two or more users) or the reporting
algorithm would need to allow for each instance to be recognized as the
same user. Similarly - names can change. Perhaps the person gets married
or divorced. "Mary Clairmont (prettygirl99)" becomes "Mary Dupont
(prettygirl99)".

For both of these problems, one could argue that the reporting tool
could take the complex value into account. It could parse out the unique
identifier. This presumes that you have access to the source code and
the ability to make the changes which (license restrictions, resource
requirements, ...). This could be true of one or two tools - but
certainly not all tools that support Subversion as this is a fairly
massive list. This is particularly problematic if there is no standard
as it means that my work in my company against my convention is not
easily shareable with your work in your company against your convention.

> 2. [This one is a guess.]
>
> The leader of a small development team sharing a Subversion repository with other teams wants to set up a build slave that will send an email to the users who committed revisions leading to a build failure. The machine can see the Subversion user id but how can it get the user's email address? The team leader could ask the repository administrator to add a post-commit hook that adds an email address to a revision property after every commit, but that
>
> * requires involving the server admin;
> * won't get updated when the user changes their email address;
> * won't work for testing old revisions that were already committed before that time;
> * won't work if the build slave software needs to read a list of all user id->email mappings at once.

Much of the above can be accomplished today as it is server side and
server side gives more flexibility as it can be customized in one place.
To extend the above to a situation that makes it more difficult -

There are a number of tools such as Crucible/FishEye that will monitor a
Subversion repository for changes, and then take action based on the
commit log. So the actions are being performed by "clients" and not by
the server itself. If the "client" sees a Subversion commit for
"1234567" or "jdoe", how does it know who is the authority on what email
is associated with this account? With svn:author being the unique
identifier - this is not that difficult in many cases as it is a simple
LDAP query away. However, if we mix 1) and 2) together, we get the same
problem. Subversion users need to see full name in "svn log" output, so
they update svn:author to include the full name like "Mark Mielke
(1234567)", and then Crucible/FishEye sees the commit as authored by
"Mark Mielke (1234567)" and how does it look up this value in LDAP to
find the email?

> 3. [This one is a guess.]
>
> An administrator wants to integrate Subversion with an issue tracker. Users have different user ids on the two tools. The admin wants to configure the tracker so that it automatically annotates an already committed Subversion revision with some status information. How can the tracker know with what user id to contact the Subversion server?

We don't have this requirement, but I believe this requirement can be
seen in situations such as:

1) Issue tracker, such as JIRA, is externally visible. Users and
customers can sign up to the external site directly. Identity management
system is stored in JIRA as these are essentially "external users".

2) Source management system, such as Subversion, is internal only. Users
and customers may be able to access the content read-only. Identity
management system is stored in Microsoft Active Directory or OpenLDAP
and are assigned according to corporate policies.

In this scenario, there are a lot of requirements to be able to map back
and forth between the internal and external ID. The binding might be
stored as an LDAP attribute such as "jirauser".

I don't know if this particular problem is for Subversion to solve or
not - but if the Subversion solution was general enough to support
configuration that might allow this information to be exposed in a
general way, somebody someday would probably be thankful. I wouldn't go
out of my way to specifically solve this requirement, though. Just, if
it comes for free with a good solution to the other requirements, don't
block it. :-)

> The rest of the proposal addresses UC1 and part of UC2 but not UC3. (UC3 looks like it needs some totally separate solution, outside of Subversion.)

Agree.

> REQUIREMENTS
>
> A Subversion client (of any kind so designed) shall be able to read extended information about the author of a revision. This information shall consist of a (possibly empty) set of fields. The set of possible extended author fields shallinclude at least:
>
> * authenticated user id
>
> * display name
> * email address
>
> It shall be possible to add other fields on the server side (by software upgrade and/or by configuration), and for a client (of any kind so designed) to discover and read these fields without any software upgrade on the client side.
> The svn:author property shall continue to exist. When not using the extended author fields, the svn:author property must continue to operate as before. When using the extended author fields, the design may restrict the use of the svn:author field. Example: the design could require that if extended author fields are to be usable then the svn:author field always holds the authenticated user id and must always be present and non-empty.

This is a smart compromise. Forwards and backwards compatibility.
Interface restrictions to guarantee extensibility.

In terms of some actual implementation of this, the documentation should
probably recommend that clients make use of the display name and email
address as standard fields, and only optionally be aware of
repository-specific additional attributes. Otherwise it gets pretty
messy in that you'd have to provide a means to make clients aware of
what is being published and how and where they should be displayed. I
would start with just the two and specific recommendations. For example,
annotated source code on a web page might show the display name, but
when one mouses over the display name or clicks on a gear icon to the
side, access to additional details might be displayed. The display name
might be linked such that a mouse click on the display name pulls up the
user profile, but the user profile would be identifier by the unique
identifier. Enough information to recommend a consistent and useful
interface, but not enough to be restrictive.

You cover some of this below:

> A client shall access the extended author fields through the Subversion server, through the existing client-server protocols, possibly with protocol extensions. Any protocol extensions shall be backward compatible in that an old server with a new client or an old client with a new server shall (without user intervention) use the old 'svn:author' property.
>
>
> The fields that are available from a particular server or repository are determined by the administrator. For any particular committed revision, the server may provide any or all or none of the extended author fields. A client cannot rely on any particular field being available except to the extent that the administrator gives such an assurance. Example: if the client requests the authenticated user id and email address for a revision whose author has no email address recorded,the server shall provide the authenticated user id but no email address. If the server is temporarily unable to look up any information about a user, the server should respond with no extended author fieldsinstead of waiting.
>
>
> The extended author fields are dynamic in the sense that the server need not always return the same values for the same committed revision. For example,a client might repeat exactly the same request for information about revision 1234 twice in quick succession, and the server might provide the email address as "a_at_b.c" the first time and "dd_at_ee.ff" the second time. Even the "authenticated user id" field could change.
>
>
> DESIGN
>
> The extended author fields are delivered through revision properties. The values are UTF-8 text. These revision properties are readable but not writable by clients.
>
> Three property names are initially designated as "well known":
>
> * prop name: "svn:author:authn-id"
> purpose: authenticated user id
> format: as used by Subversion's authentication (the default
> value of svn:author)
>
> * prop name: "svn:author:display-name"
> purpose: display name
> format: a single line (no line breaks), e.g. person's full
> name or shortened name or nickname
>
> * prop name: "svn:author:email"
> purpose: email address
> format: [TO BE SPECIFIED HERE]
>
>
> Other property names in this name space beginning with "svn:author:" can be designated as "well known" in the future, by an official announcement from the Subversion project.
>
> An administrator can configure other extended author fields to use property names that are not in the "svn:" name space. Example: an administrator could configure the property name "author:pgp-sig" to hold the author's PGP signature.

Excellent.

> SERVER DESIGN
> Any time the server is about to send a set of revision properties to
> the client, the server looks up the extended author fields and adds
> corresponding properties to the set of revision properties that it
> reports to the client. These property values override any values The server looks up the extended author fieldsthrough some mechanism not defined here,using the value of the"svn:author" property as a key. The server may cache the results, provided that there is a way for the administrator to make the server use updated information.

The cache can be a typical cache. The information that might be returned
should generally be semi-persistent and not changing from minute to
minute. As long as it takes effect within a reason time period
(configurable along with the configuration on how to obtain the extended
attribute information in the first place?) there is no problem.

> If the client attempts to set any revision property in the "svn:author:" name space, the server shall report an error to the client. This applies even if the property value matches the value that was last read from the server or is currently known to the server, and even if the
> specific property name is not known to the server. If the client attempts to set any revision property that is not in the "svn:author:" name space but might be configured as an extended author field, the server records that revision property in the normal way. If a revision property (of any name) has a stored value and the extended author field look-up also provides a value for the same property name, the latter takes priority.
>
>
> The extended author fields [are | are not] available to the following hook scripts: pre-commit, ...

Although not necessary for the fields to be available to the hook
scripts - it would be extremely convenient for them to be so. We have
hooks that perform LDAP lookups - but each hook has to have intimate
knowledge of the environment it is contained in making them difficult to
be published - for example, as an open source component that others
could re-use. They may have hard coded LDAP bind passwords for example,
making them insecure to publish. It would be extremely nice if any open
source component writer could make use of these fields without having to
care where the values come from, and the configuration for where the
values come from could be centralized in one place - the Subversion server.

> CLIENT DESIGN
>
> Just an example. The "svn log" and "svn blame" commands could request the revision property named "svn:author:display-name", and if that is returned then use it instead of "svn:author", otherwise use the value of "svn:author". Further, a client-side configuration option could specify which property name should be used for these display purposes, so for example some users in a particular team could choose to have the "author:nickname" revision property displayed instead of "svn:author:display-name".

This would be great. I think many people like to see the format that GIT
uses: Display Name <email_at_domain>. This should be an option.

> FURTHER SCOPE
>
> Does a client need to be able to look up the information in other ways, such as starting from svn:author rather than a revision number, or starting from an extended author field?
>

I'm not clear on how "svn blame" is implemented. Presuming that it knows
what commit each line belongs to and that these are already being
queried (i.e. the implementation won't have to significantly change as a
result of this proposal), it is satisfactory for it to access the
information from the revision properties. I don't at the moment see a
requirement to be able to query a list of known users, or information
for a particular user. Subversion is not a directory service. The main
capability being provided is to enable Subversion clients to be ignorant
about how the server has been configured to perform authentication and
identification of users, but still be able to provide extended
information about Subversion metadata back to the user. Staying within
domain is probably smart as it can be a clear boundary around the scope
that is being agreed to.

Final thoughts on this draft:

The reference implementation should come with perhaps two server modules
to support this capability. One should be a caching LDAP implementation
that is fully configurable. One should be based on operating system
services (PAM or getent() for Unix?). Other implementations should be
possible, but left outside of core.

If the Subversion developers agree to some refinement of this proposal,
I understand that developers resources are limited and that there is no
guarantee that it would ever be implemented or if implemented that it
would ever be completed and distributed in core. I'm thinking that this
sort of project might be a good entry point for somebody such as myself
to contribute. Not sure about time right now - but if you put in the
effort to review and refine, then it would be only fair for me to at
least try to contribute.

Thanks for the time you put into this Julian.

-- 
Mark Mielke<mark_at_mielke.cc>

Received on 2012-01-05 08:31:11 CET

This message: [ Message body ]
Next message: Peter Samuelson: "Re: eliminating sequential bottlenecks for huge commit and merge ops"
Previous message: Daniel Shahaf: "Re: [Subversion Wiki] Update of "EncryptedPasswordStorage" by CMichaelPilato"
In reply to: Julian Foad: "Re: format of svn:author"
Next in thread: Julian Foad: "Re: format of svn:author"
Reply: Julian Foad: "Re: format of svn:author"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]