On Fri, Dec 25, 2020 at 11:17 AM Daniel Shahaf <d.s_at_daniel.shahaf.name>
> > I'll figure out a way to have the mboxes downloadable. If I understand
> > Google's documentation of robots.txt they don't care about robots.txt if
> > specific URL is linked from somewhere indexable, they will index it
> > Maybe just make one big tarball of everything?
> One big tarball would be wasteful to consume (would have to download
> everything) and to produce (would need to, basically, «cp everything.tgz
> tmp.tgz; tar -zcf - $new >> tmp.tgz; mv tmp.tgz everything.tgz», and you
> see that's O(#everything) rather than O(appended stuff)). Would rather
> it if possible.
> Not sure what to do about robots. I suppose we could set <link
> rel="canonical"> in the HTTP headers when serving the rfc822 files (example
> in <https://en.wikipedia.org/wiki/Canonical_link_element#HTTP>)?
I thought robots.txt can exclude subdirs. So just cut off (say)
I'm not too worried about Google crawling the mboxes, as they'll likely do
it just once and never again (by keeping the etag and/or mtime).
> > I couldn't figure out puppet, the links was 404 for me. I've created a
> > request in Jira and I hope someone will take a look:
> > https://issues.apache.org/jira/browse/INFRA-21230
> I think the github repository is restricted to Apache committers only, so
> you'll need to enter your github username on id.apache.org in order to get
> access to that URL. If you don't have a github account, there ought to be
> a mirror of the repository on *.apache.org somewhere (at least, if Infra's
> following the same policy PMCs do).
Correct: committers only. And only after linking accounts via
https://gitbox.apache.org/setup/ as Nathan noted (and we forgot to mention
If you do not have a GitHub account, or do not want one (say, because you
don't want to accept their T&Cs), then you can use the repository via
gitbox.apache.org (ask on Slack for the link; I prefer not to post it here).
Received on 2020-12-25 23:54:26 CET