Deep Reinforcement Learning for Robotics - Part 2

Deep Reinforcement Learning for Robotics - Part 2

Welcome to the Robot Remix, where we summarise the week's need-to-know robotics and automation news.

In today's email -  

  • Killer robots - fact or fiction 🙄
  • More self-driving car troubles
  • Automation is the key to reshoring
  • $250k for AI start-ups
  • Part 2 of our series on DRL
  • Bonk testing and Burning Man


AI robots killed 29 Japanese scientists - Or so goes the conspiracy theory. Speaking at Conscious Life Expo in Los Angeles in 2017,  journalist (conspiracy theorist) Linda Moulton Howe told how, "at a top robotics company in Japan this week, four robots being developed for military applications killed 29 humans in the lab.” The video of Howe’s speech has resurfaced and is making the rounds on Twitter. We previously reviewed why robots elicit fear but this story has been debunked a few times so please don't fall for it. (News)

Beckhoff introduces new robot - ATRO is a modular industrial robot system that can be used to build a very flexible range of systems from 6-axis arms to gantries. One cool thing about its design is that fluid, power, and data are all routed internally allowing endless joint rotation. (News)

Deep tech moonshots are worth a shot - Contrary to the standard VC investment philosophy of focusing on software companies, Cantos Ventures targets hardware and biotech investments.  In this blog post, the VC explains why deep tech is actually a better financial bet than software. Pay attention investors! (Opinion)

China starting to invest Hard and Smart - On that note, China is following this trend with VC's starting to focus on hardware.  (News)

Robot team, unite - Researchers at the University of Illinois have developed a method to train multiple agents to work together using multi-agent reinforcement learning. Teams are able to coordinate and work together harmoniously without communication, but purely through AI. We’re hoping that Robot Wars will soon have a category for team competitions. (Research)

Cruising for a bruising - One day after becoming the only company authorised to operate roboaxis in San Francisco, a Cruise automated vehicle was involved in a collision. A software defect caused the vehicle to incorrectly predict another vehicle's path and become insufficiently reactive to the sudden path change of a road user. As a result, the manufacturer was forced to recall the vehicles. (News)

Don’t jump in front of a moving car- Last week we discussed Tesla’s cease and desist over videos showing their car mowing down model children. Check out this video from the IET investigating the true safety of self-driving cars. Yes, the presenter jumped in front of a moving car. (Video)

AI Grant Program now live - Set up in 2017, AI Grant is on the lookout for entrepreneurs leveraging AI technologies to invest $250,000. This is from the same company we featured last week due to its impressive approach to open-source AI. Don't spend it all at once. (News)

Automation key for effective re-shoring - “With the power of automation, our workers can win. Without it, they're in trouble.” The popular Noahpinion blog lays out an argument for why the US must invest in automation in order to achieve hopes of re-shoring production and manufacturing from China. It’s well researched - give it a read! (Weekend Read)

The Big Idea

Deep Reinforcement Learning - Part 2

This is part two of three in our Deep Reinforcement Learning Primer.  If you need a reminder on the difference between Supervised, Unsupervised, Reinforcement and Deep learning – check out last week's article here.

As we discussed, one approach has been taking the robotics industry by storm  – Deep Reinforcement Learning. It is required knowledge for anyone hoping to keep up with the robotics industry.

This week we’ll look at the topic in more detail  -  Our goal is to provide intuition behind the core ideas rather than an exhaustive theory lesson. For that we recommend - Open AI’s documentation.

What is Reinforcement Learning?

Reinforcement Learning (RL) is the task of learning through trial and error. The algorithm is not told what actions to take, only what its goal is. It must discover for itself which actions produce the greatest reward. Reinforcement learning does not require labelled data like Supervised Learning; it doesn’t even use an unlabelled dataset, like Unsupervised Learning.

Rather than seeking to discover a relationship in a dataset, Reinforcement Learning creates its own dataset by monitoring the environment and the impact of its actions.

Why does it apply to robotics?

The majority of industrial robot applications are controlled very tightly with hard coding and predefined logic. This works well for simple applications where the designer has perfect knowledge of the environment. As soon as randomness and uncertainty get introduced or the task becomes too complex, it becomes unfeasible for the designer to create a program capable of dealing with all situations without error.

Programs are built on top-down theory, while RL is built on bottom-up empiricism. Top-down is fragile - a slight perturbation and the whole system breaks down. Bottom-up evolution is much more stable and can adapt to changes in the real world. It is one of the reasons nature often outperforms human design.

How does it work?

RL Agents interact with the world through a closed feedback loop. We call RL algorithms Agents because they act with agency or at least something akin to it. The Agent receives two inputs -

  • An observation of the environment’s current situation or State
  • A Reward from the Agent’s previous action

The Agent ponders these inputs and considers its internal ‘Policy’ - the strategy the Agent has developed for achieving its goal is based on its hard-coded motivation and its learned experience. This challenge can vary in complexity based on a few factors -

  • Observability - In games like Atari or chess, the agent can have perfect knowledge of the environment's state. Whereas in the real world a robot only observes the sliver of the environment that can be captured through its imperfect sensors.
  • Discrete or Continuous - Again in a game like chess there are only a finite number of actions an Agent can take whereas a self-driving car on the open highway has near infinite options.
  • Model-Based or Model-Free- The Agent may have been told the probability distribution of Rewards for different actions and states. This is known as Model-Based. In Model-Free approaches, the agent only has the Reward to determine if the action is good or bad.

Rewarding our robot

The central idea of RL is the Reward Hypothesis-

All goals can be described as the maximization of the expected cumulative reward, or expected return

It’s pretty simple - this means that the goal of an Agent is to take actions that maximize the expected returns. Its “motivation” is to understand and predict the direct reward of an action and all of the future rewards that can be expected. It’s important that we discount future rewards due to their lower probability but they still need to be taken into account.

The expected, discounted return of a specific state is known as the Value,  this represents how good a state is for an Agent to be in. As in all things AI, the taxonomy can get a bit bloated – we have Reward, Return, Value and in a second we’ll also have Quality…

A human is required to define an Agent’s rewards and this is where RL moves from science to art. Attributing rewards can be easy for a game like chess with clear, easily quantifiable win and lose states but is much harder for real-world challenges that robots often face. Again, robots don’t have full access to their environment and need to use proxies.

As an example - The Dreamer algorithm (above) uses an Actor-Critic DRL approach to teach a quadruped robot to walk from scratch in under an hour. The Agent was rewarded for maintaining upright joint angles and forward velocity. Although the algorithm is very impressive in its results, we can see from the video it seems to have found a local minimum. The robot is successfully walking forward & is upright but is using a middle joint rather than its feet. As a result, the researchers might need to find another reward signal to ensure smooth walking - alignment is hard.

What's our policy?

The Policy is the brain of our Agent -  it defines its behaviour and tells the Agent which action to take given the state it’s in. In RL the optimal Policy is not known and must be learnt through training. This can be achieved

  • Directly, by teaching the Agent to learn which action to take, given the state is in - Policy-Based
  • Indirectly, teaching the Agent to learn which state is more valuable and then take the action that leads to the more valuable states - Value-Based
  • A mix of the two - Actor-critic

In policy-based methods, the optimal policy is found by training the policy directly and this policy is kept explicitly in memory.

In value-based methods, finding an optimal value function leads to an optimal policy. We don't store an explicit policy, only a value function. The policy is implicit and can be derived directly from the value function - "pick the action with the best value".

Learning to act

A big challenge for RL is understanding when to learn and when to act, this is known as the explore vs exploit trade-off -

  • Exploration – an Agent learns by trying random actions in order to find more information about the environment.
  • Exploitation -  known information is used to maximize the reward.

We need to balance how much we explore the environment and how much we exploit what we know about the environment. Explore too long and we waste time, exploit too early and we find a local minimum with sub-optimal results – see Dreamer.

FYI - This is the same challenge found in house hunting and partner selection.

Jack Pearson