Q-learning reward propagation method for reducing the transmission power of sensor nodes in wireless sensor networks

Yunsick Sung, Eunyoung Ahn, Kyungeun Cho

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

In wireless sensor networks (WSNs), sensor nodes and sink nodes communicate among themselves to collect and send data. Because of the volume of data transmission, it is important to minimize the power used for this communication. Q-learning can be applied to find the optimal path between two nodes. However, Q-learning suffers from having a significant learning time, meaning that the learning process must be conducted in advance in order to be applicable to WSNs. Many studies have proposed methods to decrease the learning time by reducing the state spaces and updating more Q-values at each update. Reducing the size of the Q-learning state space leads to an inability to execute the optimum action, because the correct state may not be available. Other methods utilize additional information by involving a teacher, control the time flow of Q-learning by considering the number of updates for each Q-value, or use a prioritized queue in the update procedure. Such methods are not well suited to the real-world environment and complexity of WSNs. A more suitable method involves updating the Q-values iteratively. A combination of these updating methods may enhance the reduction in learning time. This paper proposes a reward propagation method (RPM), i.e.; a method that integrates various updating algorithms to propagate the reward of the goal state to more Q-values, thus reducing the learning time required for Q-learning. By not only updating the Q-value of the last visited state and executed action, but also updating the Q-values of unvisited states and unexecuted actions, the learning time can be reduced considerably. In this method, we integrate the following three Q-value updating algorithms. First, we incorporate the concept of text{ Q }(\lambda) Q (λ) -learning. This method iteratively propagates the terminal reward to the Q-values of the visited states until the terminal reward is received. If the terminal reward were to be propagated to Q-values of unvisited states, the learning time could be further reduced. Second, the concept of reward propagation is expanded. The previous, one-step reward update method updates any Q-values of states in which the terminal reward can be received by executing only one action after receiving the terminal reward. If the terminal reward is propagated to states for which the reward will only be received by executing more than one action, more Q-values can be updated. Third, we apply a type of fuzzy Q-learning with eligibility, which updates the Q-values of unexecuted actions. Even though this method is not utilized directly, the concept of updating the Q-values of unexecuted actions is applied. To investigate how much learning time is reduced with using the proposed method, we compare it with conventional Q-learning. Given that the optimal path problem of WSNs can be remodeled as a reinforcement learning problem, a hunter-prey capture game is utilized, which involves two agents in a grid environment. RPM, which plays the role of prey, learns to receive the terminal reward and escape from the hunter. The hunter selects one of two movement policies randomly, and executes one action based on the selected policy. To measure the difference between the success rates of conventional Q-learning and RPM, an equivalent environment and parameters are used for both these methods. We conduct three experiments: the first to compare RPM and conventional Q-learning, the second to test the scalability of RPM, and the third to evaluate the performance of RPM in a differently configured environment. The RPM results are compared with those of conventional Q-learning in terms of the success rate of receiving the terminal reward. This provides a measure of the difference in required learning times. The greatest reduction in learning time is obtained with a grid size of 10 \times 10 10 × 10, no obstacles, and 3,000 episodes to be learned. In these episodes, the success rate of RPM is 232 % higher than that of conventional Q-learning. We perform two experiments to verify the scalability of RPM by changing the size of the environment. In a 12 \times 12 12 × 12 grid environment, RPM initially exhibits a maximum success rate 176 % higher than that of conventional Q-learning. However, as the size of the environment is increased, the effect of propagating the terminal rewards decreases, and the improvement in the success rate compared to conventional Q-learning decreases relative to the 10 \times 10 10 × 10 grid environment. With a 14 \times 14 14 × 14 grid environment, the relative effect of RPM declines further, giving a maximum success rate that is around 138 % higher than that of conventional Q-learning. The results of the scalability experiments show that increasing the size of the environment without changing the scope of terminal reward propagation causes a decrease in the success rate of RPM. However, RPM still exhibits a higher success rate than conventional Q-learning. The improvement in the peak success rate of between 138 % and 232 % can greatly reduce the learning time in difficult environments. If the amount of the calculation of terminal reward propagation is not a critical issue, given that the calculation amount is increased exponentially in proportion to the scope of terminal reward propagation, the learning time can be reduced further easily by increasing the scope. Finally, we compare the difference in success rates when obstacles are deployed within the 10 \times 10 10 × 10 environment. The obstacles naturally degraded the performance of both RPM and conventional Q-learning, but the proposed method still outperforms conventional Q-learning by about 20 % to 59 %. Our experimental results show that the peak success rate of the proposed method is consistently superior to that of conventional Q-learning. Given that the size of environment and the number of obstacles effects the improvement of RPM comparing to conventional Q-learning, the improvement is in proportion to the number of more updated Q-values. If the size of environment is increased, the rate how much Q-values are more updated comparing to whole Q-values is decreased. Therefore, the effect of RPM is also decreased. More the number of obstacles is increased, less the number of Q-values are update, which also reduces the effect of RPM. As the learning time can be reduced by RPM, Q-learning can be applied to diverse fields in which the learning time problem affects its application to WSNs. Although a lot of researches of Q-learning are proposed to solve the learning time problem of Q-learning, the demand of reducing the learning time is still urgent. Given that one line of researches that reduce learning time by updating Q-value more is very active topic, we expect further expansion by combining more kinds of concepts to reduce learning time on RPM.

Original languageEnglish
Pages (from-to)257-273
Number of pages17
JournalWireless Personal Communications
Volume73
Issue number2
DOIs
StatePublished - Nov 2013

Keywords

  • Q-learning
  • Reward
  • Transmission Power Problem
  • Wireless Sensor Network

Fingerprint

Dive into the research topics of 'Q-learning reward propagation method for reducing the transmission power of sensor nodes in wireless sensor networks'. Together they form a unique fingerprint.

Cite this