Re: [RFC] LKML Archive in Maildir Format
From: Jasper Spaans
Date: Tue Dec 18 2018 - 16:54:04 EST
Hi Joey,
On Sun, Dec 16, 2018 at 09:21:35AM -1000, Joey Pabalinas wrote:
> > > I spent a lot of time trying to find an LKML archive in Maildir format
> > > that I could use for local searches with nutmuch or something, but all
> > > the links I was able to find were all dead.
> >
> > You might instead use
> >
> > https://www.kernel.org/lore.html
> > https://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/git.git/
>
> That was my first attempt, but the ducumentation for the public-inbox
> format is sort of terrible, and after a few hours trying to convert it
> to Maildir I just gave up.
>
> I ended up just slowly scraping lkml.org for a couple weeks so I
> wouldn't disrupt anything and it worked fairly well. Just looking for
> advice on where to host this now so others might be able to use it.
Now you've caught my attention; first of all, there are more than 3M
messages stored in the lkml.org datase, so I guess you've missed some
messages or something is really broken.
Besides, unless you figured out how to get to the raw data, you've just
scraped a rendering which discards stuff like pgp signatures etc and has
very incomplete headers. Unless you don't care for those of course :)
Note that I've also been toying with the lore dataset, and wrote a tiny tool
to get Maildir-like data out of it; this code is a bit of a single-use-jig
so you'll need to do some coding if you really want to use it. Attached
anyway.
All the best and enjoy,
Jasper
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"
[packages]
gitpython = "*"
ipython = "*"
[dev-packages]
[requires]
python_version = "3.7"
from email.parser import BytesParser
from email.message import EmailMessage
from email.policy import default
from git import Repo
our_last_id = '<dc4d502c-bc3c-46e3-a984-41271951a5f7@xxxxxxxxxxxx>'
#'<20180711142744.GN3593@xxxxxxxxxxxxxxxxxx>'
repo = Repo('/Users/spaans/xsrc/lkml/lkml/git/6.git')
commit = repo.commit("master")
counter = 5000
froms = set()
while True:
tree = commit.tree
blob = tree['m']
data = blob.data_stream.read()
msg = BytesParser(policy=default).parsebytes(data)
msgid = msg['Message-ID']
from_ = msg['From']
froms.add(from_)
print(msgid)
#import pdb; pdb.set_trace()
if len(froms) > 1000:
print("HAVE LOTS OF FRIENDS NOW")
break
if msgid == our_last_id:
print("LADIES & GENTLEMEN, WE'VE GOT HIM")
break
parents = commit.parents
if len(parents) != 1:
print("WUH")
break
else:
commit = commit.parents[0]
#with open("output/%04d.eml" % counter, "bw") as f:
# f.write(data)
counter -= 1
import pprint
pprint.pprint(froms)
Attachment:
signature.asc
Description: PGP signature