Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lessons as Packages #25

Closed
gvwilson opened this issue Mar 16, 2015 · 15 comments
Closed

Lessons as Packages #25

gvwilson opened this issue Mar 16, 2015 · 15 comments

Comments

@gvwilson
Copy link
Contributor

The aim of this project is to build tools to turn lessons into packages that can be installed via standard package management applications. A proof of concept exists at https://github.com/swcarpentry/installable-lesson-demo-01, and more information is given in the ideas-list file.

@donalus
Copy link

donalus commented Mar 22, 2015

This is an interesting idea. Am I correct as interpreting this as much of an exploratory project as it is an attempt to create a working prototype? On the surface it seems trivial, but there seems to be a lot of interesting challenges lurking in the shadows.

Are there potential challenges with licensing of package contents with this approach? I saw in the yaml for the conda packages it lists the license, but code, text, and data may potentially be licensed differently.

@rgaiacs
Copy link
Contributor

rgaiacs commented Mar 22, 2015

Am I correct as interpreting this as much of an exploratory project as it is an attempt to create a working prototype?

Yes, this is a bit explanatory.

@wking
Copy link
Member

wking commented Mar 23, 2015

On Sun, Mar 22, 2015 at 11:07:51AM -0700, Donal Heidenblad wrote:

Am I correct as interpreting this as much of an exploratory project
as it is an attempt to create a working prototype?

I think we should clarify the scope a bit, because Greg's demo 1 and
the 2014 sprint project 2 both look like working prototypes to me.
What is this project looking for beyond that? Something that builds
LaTeX papers locally? Does it need to install any missing LaTeX
dependencies? Does it need to install LaTeX itself automatically?
Does it need to install Web2C and Automake 3? Kpathsea? A C
compiler for Kpathsea and Web2C? Bootstrapping these things in a
generic way is hard (for example, a lot of time has been invested in
Hashdist, and they still can't bootstrap themselves [4,5]). But if
you can set the dependency bar after Tex Live, you're basically
already there with Greg's demo. So what's a reasonable cutoff for
this project? Is LaTeX enough?

The plan 6 rules out existing package managers (that do all of this
stuff already), with a claim that writing packages takes too much
effort. Do we have a threshold for what a sufficiently simpler
packaging system would look like? Packaging LaTeX docs with Portage
7 is already pretty easy, and booting a Portage-native OS from a USB
stick isn't hard 8. Alternatively, you can use Docker and my
wking/gentoo-portage image (tested with 20141204) or a VM to run a
Portage-native OS, or install Portage as a guest package manager 9.
Then you can almost go from a base toolchain to a compiled paper with
an ebuild as simple as:

EAPI="5"
SRC_URI="http://arxiv.org/e-print/1210.0530v4 -> ${P}.tar.gz"
DESCRIPTION="Best Practices for Scientific Computing preprint"
HOMEPAGE="http://arxiv.org/abs/1210.0530"
LICENSE="CC-BY-3.0" # I think
SLOT="0"
KEYWORDS="amd64 x86"
DEPEND="dev-texlive/texlive-latexextra"

texlive-latexextra has authblk under preprint/ 1.

1: https://www.ctan.org/pkg/authblk is in TeX Live

S="${WORDIR}"

src_compile() {
pdflatex best-practices-scientific-computing-2012.tex &&
pdflatex best-practices-scientific-computing-2012.tex ||
die "Failed to run pdflatex"
}

src_install() {
dodoc best-practices-scientific-computing-2012.pdf
}

I had to download the paper's source by hand because arXiv blocks the
vanilla wget 10, but with the paper source hosted in a Git
repository that wouldn't be an issue. Anyhow, that doesn't seem to be
too complicated to me. I'd guess a negligable fraction of folks know
the ebuild syntax out of the gate, but that's probably true of any
existing package manager, and this is short enough that folks should
be able to easily match and replace in templates. I'd expect you
could write similarly concise packages for this paper using pacman,
Homebrew, or any other mature, source-based packaging system.

@gvwilson
Copy link
Contributor Author

On 2015-03-22 2:07 PM, Donal Heidenblad wrote:

This is an interesting idea. Am I correct as interpreting this as much
of an exploratory project as it is an attempt to create a working
prototype? On the surface it seems trivial, but there seems to be a
lot of interesting challenges lurking in the shadows.

There's certainly a lot of exploration involved, but there has to be a
working system at the end of the day. The prototype at
git@github.com:swcarpentry/installable-lesson-demo-01, for example, has
to make decisions about where to store lessons when they've been
installed, where to put things like exercise files for learners to work
on, how to re-set when the learner wants to start an exercise over, how
to report dependencies between lessons (my tutorial on how a call stack
works could depend on anyone's introduction to writing functions -
that's different from how most package managers track dependencies), etc.

Are there potential challenges with licensing of package contents with
this approach? I saw in the yaml for the conda packages it lists the
license, but code, text, and data may potentially be licensed differently.

Yup - a working system will have to handle that without making the
simple case complicated.

@wking
Copy link
Member

wking commented Mar 23, 2015

Hmm, is this about packaging lessons or packaging papers? The title
for this issue 1 and the idea sounds like it's focused on lessons,
but the text for the idea sounds like it's talking about papers [2](which is why I used a paper for my ebuild example).

@gvwilson
Copy link
Contributor Author

gvwilson commented Mar 23, 2015 via email

@wking
Copy link
Member

wking commented Mar 23, 2015

On Mon, Mar 23, 2015 at 12:52:20PM -0700, Greg Wilson wrote:

I think (well, I hope) that a system capable of handling one will be
able to handle the other. The use cases are "I want to review your
paper" and "I want to combine your paper with two others to do new
research" - the first is more-or-less standalone, unless your paper
depends on your previous work, but the second is closer to the
lesson installation example.

For “I want to rebuild your paper” or “rerun your simulation” (or
basically “re… ” anything), I think existing package managers already
have you covered (for example, see my “rebuild your paper” ebuild
above). If the goal of this project is just to polish that UX up for
folks new to package management, that's fine (but I think we want to
clarify what sort of polishing we think needs to happen).

For “I want to combine your paper with two others”, that's going to be
reasonably easy (but hard to completely automate) if both papers use
the same tools (e.g. LaTeX with similar packages, or Markdown with
similar extensions, or …). But combining a paper written in markdown
with another written in LaTeX is going to be really hard to
automate. And what does “combining papers” mean? I think web-enabled
research is going to become mature when folks maintain their papers
like developers maintain libraries. E.g. there's a team managing the
“protein force spectroscopy” library, and folks doing new research
publish by sending PRs to that team, so there's one location (or a few
parallel locations in the event of a fork/competitor) that always
documents the state of the art for that particular niche. But writing
a new paper that cleanly synthesizes two existing papers is probably
not something that a machine should be doing (and certainly not
something that a package manager should be doing).

Recording prerequisites as dependencies (what the user should know
before reading this) seems completely separate from build dependencies
(what the computer should know before running this). Are you looking
for a package manager that automatically creates a curriculum for a
user by tracking their mental state like a package tree? “You want to
learn about call stacks, and I remember that you already completed a
lesson on variables, so I suggest you read lesson-123 on functions,
and then lesson-456 on call stacks”.

Or are we comfortable listing prerequisites in the text and letting
the student be their own prerequisite package manager? “Ah, this
call-stack lesson says I should understand about functions. I'm not
very familiar with them, so I'll click on this link to one of their
suggested function lessons…”

You also point out issues with lesson interactivity (“how to re-set
when the learner wants to start an exercise over” 1), but that
sounds like something the lesson can work out internally, and not
something that a package manager needs to get involved in. Although I
expect most lessons will want to use a traditional package manager so
they can share that interactivity logic via an external library/tool.

@donalus
Copy link

donalus commented Mar 26, 2015

I am concerned with the suggestion that ebuild or any of the software make systems are "easy enough". Adoption is a huge issue and I think the plan nails it with: "creating a package manifest for each paper would require more effort than most scientists would be willing to put in". I think that is the primary hurdle, but maybe by focusing on lessons we could skirt the issue and focus on the technical challenges first. Lesson creators can be expected to suffer more hurdles than a scientist without a funding agency mandate. I guess the hope would be that the people packaging lessons would also try to package their papers, which could then identify friction points that would need to be addressed. So in the short term, we'd accept that most scientists would think we are crazy, but with the goal of making it easier over time.

Once we start talking about managing lesson pre-requisites it starts to sound like we are talking about enabling lessons to be used outside of the workshops in more of an online learning context. That might be desirable, but there are mature tools available that could be used rather than extending a package manager. Packaging lessons complements both workshops and online learning, though, so it isn't exclusionary, just a question of identifying the right tools and defining the right scope.

@wking
Copy link
Member

wking commented Mar 26, 2015

On Thu, Mar 26, 2015 at 12:55:38PM -0700, Donal Heidenblad wrote:

I am concerned with the suggestion that ebuild or any of the
software make systems are "easy enough".

I agree, but I'm trying to put my finger on why these approaches have
not been adopted already. My current guesses are:

a. There's no incentive to package your paper/lesson for reuse. Even
if it were really easy, folks are busy, and this is something you
can skip.
b. Folks are less likely to be running systems like Gentoo/Arch/…
where source-based package managers (which have, I think, the
lowest bar for creating a new package). The learning cost of
switching to such a system is too high for the small gain you'd get
from easily packaging your papers/lessons.

You can work around (b) by using a guest package manager (HashDist,
pip, Homebrew, npm, …), but it's hard to integrate a guest package
manager with your system libraries, and you'll end up maintaining a
quasi VM just for packaging papers/lessons. Again, folks see this as
more trouble than its worth.

Adoption is a huge issue and I think the plan nails it with:
"creating a package manifest for each paper would require more
effort than most scientists would be willing to put in".

I don't think that this is the problem. Creating the manifest is easy
(as I showed in my ebuild above), but installing the system the
consumes the manifest (in this case, Portage) is hard. I don't think
the idea (as it's currently written) is aiming at making it easier to
get folks onto Gentoo/Arch/… with native source-based package
managers, nor is it aiming at making the interface between host and
guest package managers more robust.

So in the short term, we'd accept that most scientists would think
we are crazy, but with the goal of making it easier over time.

Yeah, this sounds right to me ;).

Once we start talking about managing lesson pre-requisites it starts
to sound like we are talking about enabling lessons to be used
outside of the workshops in more of an online learning context. That
might be desirable, but there are mature tools available that could
be used rather than extending a package manager.

Such as… ?

@rgaiacs
Copy link
Contributor

rgaiacs commented Mar 26, 2015

@donalus I don't know why @gvwilson forgot to include links to @khinsen's blog

The biggest problem with this idea is that we are being too greedy. What I can suggest given the time that you have to write a proposal is:

  • Use conda as starting point because it can handle Python and R (I didn't tested this).
  • Design a format to keep the Python and R dependence.
  • Implement a "package management for papers" based on conda that will read the dependence file, download the paper, the software it uses (e.g. numpy, matplotlib, ...) and compile the paper. If the paper uses LaTeX we will leave it as .tex and if the paper uses Pandoc we will leave it as .md.

@donalus
Copy link

donalus commented Mar 26, 2015

Thank you, @r-gaia-cs that is really helpful info. I agree with the greedy. My product manager brain didn't want to sign up for anything that I couldn't accomplish within the timeframe of GSOC. This scope seems reasonable and a good platform to build toward the rich feature set that I think we all want.

I just talked to Carole Goble in the hallway after she gave a great plenary talk on reproducibility in research at the conference I am attending. Her slides heavily featured the SC/DC logos. It made me start thinking about this more heavily and want to tackle this challenge, so I will do my best to get something in for tomorrow's deadline.

@rgaiacs
Copy link
Contributor

rgaiacs commented Mar 26, 2015 via email

@khinsen
Copy link

khinsen commented Mar 27, 2015

@r-gaia-cs Thanks for adding the references. There's also my recent article on ActivePapers (http://dx.doi.org/10.12688/f1000research.5773.2), which discusses the relation between software deployment, software preservation, and reproducible research.

Package managers are primarily about software deployment, and the aim of this project is to extend them to reproducible research, which makes some excursion into the territory of software preservation inevitable. I see two reallly difficult aspects in this: (1) software preservation in such a system can be no better than what the underlying platform guarantees. Python + conda is neither great nor very bad in this respect, but as soon as you integrate C code, problems will certainly appear. (2) the user interface, or, put differently, working towards acceptance by scientists. Package managers don't have a great reputation for usability. Again conda is one of the nicer ones, so it's worth looking into.

@rgaiacs
Copy link
Contributor

rgaiacs commented Mar 27, 2015

@khinsen Thanks for the link to your recent article and made it under CC0, ✨. Maybe with it I will understand ActivePapers once for all. =)

@donalus There is less than 7h so if you want to work on that,

  1. Submit a very small draft of your proposal at https://www.google-melange.com/gsoc/homepage/google/gsoc2015 as soon as possible.
  2. If you need to chat, you can call me under IRC (#swcarpentry under Freenode).

@rgaiacs rgaiacs added the 2015 label Mar 28, 2015
@rgaiacs
Copy link
Contributor

rgaiacs commented Mar 28, 2015

I'm closing this issue since student application period is over.

@rgaiacs rgaiacs closed this as completed Mar 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants