-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation details #13
Comments
So I'll preface this with two things, one that this would be my first foray into RL and that the paper was woefully inadequate in implementation details.
Yes I modeled the entire thing as episodic memory style training, cause that's what I was familiar with (my knowledge is mainly from Deep Minds papers on AlphaGo, Atari DQNs etc so I transfered what I knew to this). Of course, the original authors might have used something more advanced (I am pretty sure they did), and therefore their implementation would be vastly superior to mine. |
Thanks for the quick response! And again, I understand on the serious lack of details in the paper! My background is more on the RL side, but I'm new to optimizing NN architectures. I agree with your points on (1) and (2). Regarding (1), yes, in RL you typically take the mode during evaluation, but sample during training. The policy gradient theorem itself depends on having actions sampled from a stochastic policy. However, for batch size = 1, I see how it might stabilize training. Regarding (3), the state is chained when selecting the different architecture components. If you view one architecture as a single RL episode, there are multiple actions in that episode which correspond to the multiple building blocks of the architecture. Within an episode, the policy depends on the previous actions. However, in your implementation, the start of one architecture is also chained with the end of the previous architecture. I don't think the paper suggests this part. I also don't see the intuition. The RNN controller defines a distribution over architectures. You should be able to take i.i.d. samples from this distribution. The author's ENAS implementation (which has more to it but the basics are the same as NAS) uses the variable Again, it seems like there could be a difference in viewing the entire procedure as one long episode (with each action corresponding to an architecture), versus each architecture comprising an episode (with each action corresponding to a building block of an architecture). This also explains why (4) didn't make sense to me. I don't see any discounting in the ENAS code but I haven't looked at everything. |
Hmm the comments on (3) and (4) are quite interesting. I think in this code, if each architecture starts with a zero state in the beginning, and I don't sample the first state, so if I used zero state each time, I would get the same architecture each "sample" It makes sense to sample the first cell, and then chain the remainder as ENAS does it. This is something stemming from my decision to not use sampling during training to reduce complexity. Due to this, at test time, I was forced to chain multiple architectures together, otherwise starting from a zero state vector each time would lead to the same architecture being sampled repeatedly. I kind of moved onto progressive nas since it's SMBO methodology of training is something I understand a bit better. |
Hi, thank you for releasing this implementation! I have several questions regarding how this works relative to my understanding of the original paper:
Why is the action always the mode (i.e. argmax of the logits) of the policy each iteration, rather than sample from the policy? The idea behind policy gradients it that you sample an action from a policy. Is this because batch size is 1, so rather than sampling you just take the mode?
You also take a random/exploratory action with probability epsilon. However, the logprob (and thus loss) of exploratory actions seem to be computed incorrectly. Your loss uses
self.policy_classifiers
to compute the logprob; however, these tensors are conditioned on the argmax of the policy, not the action that was actually taken. The loss should always use the logprob of the action taken (in this case the exploratory one), not the argmax of the policy.Each time you generate a list of actions (i.e. a single architecture), you start by feeding in the state from the previous architecture (L78 of train.py). Is this correct, and if so what is the reason for this? Each iteration the controller has new parameters and will thus emit different architectures. Feeding the controller the previous state as
I don't understand the discounting, or rather, why discounting is performed at all. A single architecture gives you a single reward, in a sort of "1-step" episode process.
Based on 3 and 4, it seems like you view the entire process as a single RL "episode," as opposed to each architecture comprising its own episode. Is this correct?
I understand the original paper is very unclear on a lot of these points, so I appreciate any feedback or discussion!
The text was updated successfully, but these errors were encountered: