Lessons as Packages #25

gvwilson · 2015-03-16T01:56:48Z

The aim of this project is to build tools to turn lessons into packages that can be installed via standard package management applications. A proof of concept exists at https://github.com/swcarpentry/installable-lesson-demo-01, and more information is given in the ideas-list file.

donalus · 2015-03-22T18:07:49Z

This is an interesting idea. Am I correct as interpreting this as much of an exploratory project as it is an attempt to create a working prototype? On the surface it seems trivial, but there seems to be a lot of interesting challenges lurking in the shadows.

Are there potential challenges with licensing of package contents with this approach? I saw in the yaml for the conda packages it lists the license, but code, text, and data may potentially be licensed differently.

rgaiacs · 2015-03-22T18:24:28Z

Am I correct as interpreting this as much of an exploratory project as it is an attempt to create a working prototype?

Yes, this is a bit explanatory.

wking · 2015-03-23T03:39:17Z

On Sun, Mar 22, 2015 at 11:07:51AM -0700, Donal Heidenblad wrote:

Am I correct as interpreting this as much of an exploratory project
as it is an attempt to create a working prototype?

I think we should clarify the scope a bit, because Greg's demo 1 and
the 2014 sprint project 2 both look like working prototypes to me.
What is this project looking for beyond that? Something that builds
LaTeX papers locally? Does it need to install any missing LaTeX
dependencies? Does it need to install LaTeX itself automatically?
Does it need to install Web2C and Automake 3? Kpathsea? A C
compiler for Kpathsea and Web2C? Bootstrapping these things in a
generic way is hard (for example, a lot of time has been invested in
Hashdist, and they still can't bootstrap themselves [4,5]). But if
you can set the dependency bar after Tex Live, you're basically
already there with Greg's demo. So what's a reasonable cutoff for
this project? Is LaTeX enough?

The plan 6 rules out existing package managers (that do all of this
stuff already), with a claim that writing packages takes too much
effort. Do we have a threshold for what a sufficiently simpler
packaging system would look like? Packaging LaTeX docs with Portage
7 is already pretty easy, and booting a Portage-native OS from a USB
stick isn't hard 8. Alternatively, you can use Docker and my
wking/gentoo-portage image (tested with 20141204) or a VM to run a
Portage-native OS, or install Portage as a guest package manager 9.
Then you can almost go from a base toolchain to a compiled paper with
an ebuild as simple as:

EAPI="5"
SRC_URI="http://arxiv.org/e-print/1210.0530v4 -> ${P}.tar.gz"
DESCRIPTION="Best Practices for Scientific Computing preprint"
HOMEPAGE="http://arxiv.org/abs/1210.0530"
LICENSE="CC-BY-3.0" # I think
SLOT="0"
KEYWORDS="amd64 x86"
DEPEND="dev-texlive/texlive-latexextra"

texlive-latexextra has authblk under preprint/ 1.

1: https://www.ctan.org/pkg/authblk is in TeX Live

S="${WORDIR}"

src_compile() {
pdflatex best-practices-scientific-computing-2012.tex &&
pdflatex best-practices-scientific-computing-2012.tex ||
die "Failed to run pdflatex"
}

src_install() {
dodoc best-practices-scientific-computing-2012.pdf
}

I had to download the paper's source by hand because arXiv blocks the
vanilla wget 10, but with the paper source hosted in a Git
repository that wouldn't be an issue. Anyhow, that doesn't seem to be
too complicated to me. I'd guess a negligable fraction of folks know
the ebuild syntax out of the gate, but that's probably true of any
existing package manager, and this is short enough that folks should
be able to easily match and replace in templates. I'd expect you
could write similarly concise packages for this paper using pacman,
Homebrew, or any other mature, source-based packaging system.

gvwilson · 2015-03-23T18:50:46Z

On 2015-03-22 2:07 PM, Donal Heidenblad wrote:

This is an interesting idea. Am I correct as interpreting this as much
of an exploratory project as it is an attempt to create a working
prototype? On the surface it seems trivial, but there seems to be a
lot of interesting challenges lurking in the shadows.

There's certainly a lot of exploration involved, but there has to be a
working system at the end of the day. The prototype at
git@github.com:swcarpentry/installable-lesson-demo-01, for example, has
to make decisions about where to store lessons when they've been
installed, where to put things like exercise files for learners to work
on, how to re-set when the learner wants to start an exercise over, how
to report dependencies between lessons (my tutorial on how a call stack
works could depend on anyone's introduction to writing functions -
that's different from how most package managers track dependencies), etc.

Are there potential challenges with licensing of package contents with
this approach? I saw in the yaml for the conda packages it lists the
license, but code, text, and data may potentially be licensed differently.

Yup - a working system will have to handle that without making the
simple case complicated.

wking · 2015-03-23T19:42:23Z

Hmm, is this about packaging lessons or packaging papers? The title
for this issue 1 and the idea sounds like it's focused on lessons,
but the text for the idea sounds like it's talking about papers [2](which is why I used a paper for my ebuild example).

gvwilson · 2015-03-23T19:52:18Z

I think (well, I hope) that a system capable of handling one will be able to handle the other. The use cases are "I want to review your paper" and "I want to combine your paper with two others to do new research" - the first is more-or-less standalone, unless your paper depends on your previous work, but the second is closer to the lesson installation example.

wking · 2015-03-23T20:22:25Z

On Mon, Mar 23, 2015 at 12:52:20PM -0700, Greg Wilson wrote:

I think (well, I hope) that a system capable of handling one will be
able to handle the other. The use cases are "I want to review your
paper" and "I want to combine your paper with two others to do new
research" - the first is more-or-less standalone, unless your paper
depends on your previous work, but the second is closer to the
lesson installation example.

For “I want to rebuild your paper” or “rerun your simulation” (or
basically “re… ” anything), I think existing package managers already
have you covered (for example, see my “rebuild your paper” ebuild
above). If the goal of this project is just to polish that UX up for
folks new to package management, that's fine (but I think we want to
clarify what sort of polishing we think needs to happen).

For “I want to combine your paper with two others”, that's going to be
reasonably easy (but hard to completely automate) if both papers use
the same tools (e.g. LaTeX with similar packages, or Markdown with
similar extensions, or …). But combining a paper written in markdown
with another written in LaTeX is going to be really hard to
automate. And what does “combining papers” mean? I think web-enabled
research is going to become mature when folks maintain their papers
like developers maintain libraries. E.g. there's a team managing the
“protein force spectroscopy” library, and folks doing new research
publish by sending PRs to that team, so there's one location (or a few
parallel locations in the event of a fork/competitor) that always
documents the state of the art for that particular niche. But writing
a new paper that cleanly synthesizes two existing papers is probably
not something that a machine should be doing (and certainly not
something that a package manager should be doing).

Recording prerequisites as dependencies (what the user should know
before reading this) seems completely separate from build dependencies
(what the computer should know before running this). Are you looking
for a package manager that automatically creates a curriculum for a
user by tracking their mental state like a package tree? “You want to
learn about call stacks, and I remember that you already completed a
lesson on variables, so I suggest you read lesson-123 on functions,
and then lesson-456 on call stacks”.

Or are we comfortable listing prerequisites in the text and letting
the student be their own prerequisite package manager? “Ah, this
call-stack lesson says I should understand about functions. I'm not
very familiar with them, so I'll click on this link to one of their
suggested function lessons…”

You also point out issues with lesson interactivity (“how to re-set
when the learner wants to start an exercise over” 1), but that
sounds like something the lesson can work out internally, and not
something that a package manager needs to get involved in. Although I
expect most lessons will want to use a traditional package manager so
they can share that interactivity logic via an external library/tool.

donalus · 2015-03-26T19:55:36Z

I am concerned with the suggestion that ebuild or any of the software make systems are "easy enough". Adoption is a huge issue and I think the plan nails it with: "creating a package manifest for each paper would require more effort than most scientists would be willing to put in". I think that is the primary hurdle, but maybe by focusing on lessons we could skirt the issue and focus on the technical challenges first. Lesson creators can be expected to suffer more hurdles than a scientist without a funding agency mandate. I guess the hope would be that the people packaging lessons would also try to package their papers, which could then identify friction points that would need to be addressed. So in the short term, we'd accept that most scientists would think we are crazy, but with the goal of making it easier over time.

Once we start talking about managing lesson pre-requisites it starts to sound like we are talking about enabling lessons to be used outside of the workshops in more of an online learning context. That might be desirable, but there are mature tools available that could be used rather than extending a package manager. Packaging lessons complements both workshops and online learning, though, so it isn't exclusionary, just a question of identifying the right tools and defining the right scope.

wking · 2015-03-26T20:11:25Z

On Thu, Mar 26, 2015 at 12:55:38PM -0700, Donal Heidenblad wrote:

I am concerned with the suggestion that ebuild or any of the
software make systems are "easy enough".

I agree, but I'm trying to put my finger on why these approaches have
not been adopted already. My current guesses are:

a. There's no incentive to package your paper/lesson for reuse. Even
if it were really easy, folks are busy, and this is something you
can skip.
b. Folks are less likely to be running systems like Gentoo/Arch/…
where source-based package managers (which have, I think, the
lowest bar for creating a new package). The learning cost of
switching to such a system is too high for the small gain you'd get
from easily packaging your papers/lessons.

You can work around (b) by using a guest package manager (HashDist,
pip, Homebrew, npm, …), but it's hard to integrate a guest package
manager with your system libraries, and you'll end up maintaining a
quasi VM just for packaging papers/lessons. Again, folks see this as
more trouble than its worth.

Adoption is a huge issue and I think the plan nails it with:
"creating a package manifest for each paper would require more
effort than most scientists would be willing to put in".

I don't think that this is the problem. Creating the manifest is easy
(as I showed in my ebuild above), but installing the system the
consumes the manifest (in this case, Portage) is hard. I don't think
the idea (as it's currently written) is aiming at making it easier to
get folks onto Gentoo/Arch/… with native source-based package
managers, nor is it aiming at making the interface between host and
guest package managers more robust.

So in the short term, we'd accept that most scientists would think
we are crazy, but with the goal of making it easier over time.

Yeah, this sounds right to me ;).

Once we start talking about managing lesson pre-requisites it starts
to sound like we are talking about enabling lessons to be used
outside of the workshops in more of an online learning context. That
might be desirable, but there are mature tools available that could
be used rather than extending a package manager.

Such as… ?

rgaiacs · 2015-03-26T20:38:11Z

@donalus I don't know why @gvwilson forgot to include links to @khinsen's blog

The biggest problem with this idea is that we are being too greedy. What I can suggest given the time that you have to write a proposal is:

Use conda as starting point because it can handle Python and R (I didn't tested this).
Design a format to keep the Python and R dependence.
Implement a "package management for papers" based on conda that will read the dependence file, download the paper, the software it uses (e.g. numpy, matplotlib, ...) and compile the paper. If the paper uses LaTeX we will leave it as .tex and if the paper uses Pandoc we will leave it as .md.

donalus · 2015-03-26T21:35:50Z

Thank you, @r-gaia-cs that is really helpful info. I agree with the greedy. My product manager brain didn't want to sign up for anything that I couldn't accomplish within the timeframe of GSOC. This scope seems reasonable and a good platform to build toward the rich feature set that I think we all want.

I just talked to Carole Goble in the hallway after she gave a great plenary talk on reproducibility in research at the conference I am attending. Her slides heavily featured the SC/DC logos. It made me start thinking about this more heavily and want to tackle this challenge, so I will do my best to get something in for tomorrow's deadline.

rgaiacs · 2015-03-26T21:39:59Z

Great.

khinsen · 2015-03-27T07:31:19Z

@r-gaia-cs Thanks for adding the references. There's also my recent article on ActivePapers (http://dx.doi.org/10.12688/f1000research.5773.2), which discusses the relation between software deployment, software preservation, and reproducible research.

Package managers are primarily about software deployment, and the aim of this project is to extend them to reproducible research, which makes some excursion into the territory of software preservation inevitable. I see two reallly difficult aspects in this: (1) software preservation in such a system can be no better than what the underlying platform guarantees. Python + conda is neither great nor very bad in this respect, but as soon as you integrate C code, problems will certainly appear. (2) the user interface, or, put differently, working towards acceptance by scientists. Package managers don't have a great reputation for usability. Again conda is one of the nicer ones, so it's worth looking into.

rgaiacs · 2015-03-27T11:53:21Z

@khinsen Thanks for the link to your recent article and made it under CC0, ✨. Maybe with it I will understand ActivePapers once for all. =)

@donalus There is less than 7h so if you want to work on that,

Submit a very small draft of your proposal at https://www.google-melange.com/gsoc/homepage/google/gsoc2015 as soon as possible.
If you need to chat, you can call me under IRC (#swcarpentry under Freenode).

rgaiacs · 2015-03-28T01:34:10Z

I'm closing this issue since student application period is over.

rgaiacs added the student-needed label Mar 25, 2015

rgaiacs added the 2015 label Mar 28, 2015

rgaiacs closed this as completed Mar 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lessons as Packages #25

Lessons as Packages #25

gvwilson commented Mar 16, 2015

donalus commented Mar 22, 2015

rgaiacs commented Mar 22, 2015

wking commented Mar 23, 2015

gvwilson commented Mar 23, 2015

wking commented Mar 23, 2015

gvwilson commented Mar 23, 2015 via email

wking commented Mar 23, 2015

donalus commented Mar 26, 2015

wking commented Mar 26, 2015

rgaiacs commented Mar 26, 2015

donalus commented Mar 26, 2015

rgaiacs commented Mar 26, 2015 via email

khinsen commented Mar 27, 2015

rgaiacs commented Mar 27, 2015

rgaiacs commented Mar 28, 2015

Lessons as Packages #25

Lessons as Packages #25

Comments

gvwilson commented Mar 16, 2015

donalus commented Mar 22, 2015

rgaiacs commented Mar 22, 2015

wking commented Mar 23, 2015

texlive-latexextra has authblk under preprint/ 1.

1: https://www.ctan.org/pkg/authblk is in TeX Live

gvwilson commented Mar 23, 2015

wking commented Mar 23, 2015

gvwilson commented Mar 23, 2015 via email

wking commented Mar 23, 2015

donalus commented Mar 26, 2015

wking commented Mar 26, 2015

rgaiacs commented Mar 26, 2015

donalus commented Mar 26, 2015

rgaiacs commented Mar 26, 2015 via email

khinsen commented Mar 27, 2015

rgaiacs commented Mar 27, 2015

rgaiacs commented Mar 28, 2015