Architecture and training of an FEPS agent.
<p>a) Architecture of a FEPS agent, with four sensory states (squares) and two possible actions (diamonds). The agent has two main components: the world model and the policy. The world model is composed of vertices representing observations (squares) while clone clips represent all values a be...
محفوظ في:
| المؤلف الرئيسي: | |
|---|---|
| مؤلفون آخرون: | , , , |
| منشور في: |
2025
|
| الموضوعات: | |
| الوسوم: |
إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
|
| الملخص: | <p>a) Architecture of a FEPS agent, with four sensory states (squares) and two possible actions (diamonds). The agent has two main components: the world model and the policy. The world model is composed of vertices representing observations (squares) while clone clips represent all values a belief state can take (circles). As in a clone-structured graph, each clone clip <i>b</i> relates to exactly one observation <i>s</i> and the emission function is deterministic. The clone clips, together with the set of edges between them, form an ECM. A belief state, circled in purple, is designated by an excited clone clip. The weighted edges in the ECM encode the transition function and are trainable with reinforcement: there is one set of edges per action (light and dark turquoise arrows). The belief state in the ECM is an input to the policy, where the probability of sampling an action is a function of the EFE. In turn, the action that was selected determines the edge set to sample from in the world model in order to make a prediction for the next belief state and observation. b) Training of the world model of a FEPS agent. The agent interacts with the environment by receiving observations and implementing actions. When an action <i>a</i><sub><i>t</i></sub> is chosen, a corresponding edge is sampled in the world model, from the current to the next belief state, conditioned on the action. The observation <i>s</i><sub><i>t</i> + 1</sub> associated with the next belief state is the prediction for the next sensory state. Simultaneously, the action is applied to the environment and creates a transition in the hidden states of the environment, (bottom, green rectangle). This transition is perceived by the agent through the observation . Finally, the weights of the edges are updated. The reinforcement of an edge is proportional to the number of correct predictions it enabled in a row, as depicted with the thickness of the arrows in the world model. When the agent makes an incorrect prediction (the purple arrow), the reinforcements are applied to the edges that contributed to the trajectory. The last, incorrect, edge is not reinforced.</p> |
|---|