(i). Page 2, Section 1: It seems that much of the discussion in the draft is about monitoring applications (e.g., Section 3.1, 3.2). However, Section 3.2 talks about monitoring to improve fairness in wireless networks. More detail on this would be helpful. I just read the references and I think I understand the use case. A couple of comments on the bullets on page 3: (a). I assume we want to do model-free RL. That is, a Markov Decision Process (MDP) (S,A,P,R,\gamma) where we don't know P (the probability/state transition matrix) or R (the reward function). This still requires that we know the state space S and which actions a \in A that we can take in a state s \in S, and possibly some estimate of the reward function, if it isn't inherent (and example of inherent reward function is a video game, where the reward is your score). Most of the work then is in estimating the expected discounted reward. There's ton's of literature/code on this one. (b). We might need a definition of goal states as well. (c). In networks my guess is that we really have a Partially Observable MDP (POMDP), which introduces another set of issues. (ii). Section 3.1 It might be useful to decribe what "optimal paths" are. For example, are paths "trajectories" in the state space? (iii). Section 3.3 Its not clear how RL would be applied to network issues such as latency, etc. A bit of discussion of that would be helpful. (iv). As I mentioned the emergence of 2-player games (minimax and others) in ML is really interesting. Technology like variational autoencoders, GANs, AlphaGo and others. (v). Section 5.2 What is the reward function? Also, as I mentioned above one needs to know the state space S and action space A. In addition, there is a classic tradeoff between exploitation and exploration which gates learning; that is kind of indirectly alluded to in this section, but it might be worth explictly explainging that tradeoff and how it is managed here. Optimal paths are mentioned again. It might be worthwhile defining what exact an optimal path is (for example, is it a path (trajectory) in the state space or something different?). What the Distance-and Frequency technique is isn't defined ("based on Euclicdean distance", but between what points, and what do those points represent?). (vi). Section 5.4 This section makes it seem as if paths are trajectories in the state space. Is that correct? It might be useful to describe how agents communicate, how the distributed RL algorithm works, what its properties are, etc. Also, what are privacy and security implications of the distributed enviroment (how much information is exchanged, what is is sensitivity, how is it protected, ...) "The agents have limited resources and incomplete knowledge of their environments." -- Does this mean that the model is a POMDP? (vii). Cluttered-index-based scheme This is not really described; it might be helpful to give an overview of what the Cluttered-index-based scheme is and what its properties are. (viii). Section 6 "...shown in figure 1, where the architecture is combined with a hybrid architecture making use of both a master / slave architecture and a peer-to-peer." This is hard to understand. Which is the "architecture" and which is the "hybrid architecture"? Here it would be useful to understand the distributed RL algorithm that this architecture is supporting. (ix). Figure 3 is really hard to understand. I can't really understand the algorithm and its properties from this. In addition: (a). In the "Do optimized exploration..." box (3). How is R calculated (4). Where does the Policy come from (or is it just epsilon-greedy)? (6). Now sure what "update the learning model" means. What is updated? (7). Where does Sn come from? ((4) in the above box?) Finally, we might consider Evolution Strategies [0,1] as more black-box approach to RL (doesn't require gradients, highly parallelizable, etc). Code is here: https://github.com/openai/evolution-strategies-starter [0] https://blog.openai.com/evolution-strategies/ [1] https://arxiv.org/abs/1703.03864