This Week in Robotics 04/04

This Week in Robotics 04/04

Welcome to the Robot Remix, where we provide weekly insight on robotics, autonomy and AI

This week -

  • As little about OpenAI as possible (see memes below)
  • Google & Meta on the route to robots
  • Growth in AI jobs
  • Cognitive scientists weigh in on AI



Google has developed an auto-bagging system using ABB's Yumi. Plastic bags are ubiquitous - used at home, in shops and in factories but are one of the hardest items for a robot to manipulate.  

Why so challenging? They are thin, flexible, and can deform in many ways, leading to self-occlusions and unpredictable dynamics.  They're also translucent, making perception difficult.

The solution?

  1. The bags were segmented using fluorescent paint to mark the handles, rim and middle.  The bag looks normal under regular lighting, but under UV lights, the paint glows
  2. The robot systematically explores the bag's "state space" by manipulating the bag through a set of primitive actions (Open, flip, rotate,). During this process, the light is cycled from visible to UV.  
  3. The UV light and markers allow the robot to collect ground-truth labels without human annotation, enabling highly scalable, low-effort model training
  4. Once trained, the algorithm divides the bagging task into three stages: (1) Orienting the bag upward, (2) enlarging the opening, and (3) inserting the objects and lifting the bag.

This process achieved a 16/30 success rate -  nowhere near good enough for industry but it's an interesting approach.


Meta is tackling the big bottleneck in embodied robotics - data capture & training. Traditionally training data is collected through demonstration, self-learning or simulation.

  1. Demonstration/self-learning - is time-consuming and expensive.
  2. Simulation -  rarely captures the complexity of the real world, and models trained this way often have poor success rates

Meta tackled Challenge (1) by developing a model that can learn short-time horizon tasks like manipulation from first-person videos of humans. They compiled 4,000+ hours of human video and a trained a single perception model that matches or surpasses StoA performance on 17 short time horizon tasks in simulation.

Why it's interesting? Their approach implies we can train AIs on physical tasks by filming our day-to-day activities rather than taking the time to develop specific data.

Meta tackled Challenge (2) and improved simulation results for long-time horizon tasks by taking the counterintuitive approach of rejecting high-fidelity simulations in favour of an abstract sim that does not model low-level physics.

Why it's interesting? This approach bridged the gap between sim and reality, leading to a 98% success rate.


Stanford researchers have developed a low-cost, open-source teleoperation solution - ALOHA: 𝐀 𝐋ow-cost 𝐎pen-source 𝐇𝐀rdware System for Bimanual Teleoperation.

The system costs <$20k and provides a simple-to-use platform for teaching robots dual-handed manipulation.

How does it work? ALOHA has two leader & two follower arms and syncs the joint positions from leaders to followers at 50Hz. The user moves the leader robots, and the follower arms... follow. The whole system takes ten lines of code to implement.



Berkshire Grey, a developer of robotic intralogistics solutions, has agreed to merge with SoftBank. The deal promises to buy the Berkshires' outstanding shares for $375 million in cash, removing them from the public markets.

There are some grumbles. It's an attractive exit, but investors might be disappointed as the company has traded as high as $3.72 per share in the last 12 months, well over double the deal's share price of $1.37.

It's still a good deal. Although the company is still growing, profits and cash flow remained challenging (they lost $26.9m in Q3, 2023).

The mobile robot industry is still nascent, and with 100s of providers, we can expect further consolidation and acquisitions. 6 River Systems and Fetch were recently acquired, by Shopify and Zebra, respectively.

Saildrone. has released the Voyager, a 33-foot uncrewed sailboat. The system sports cameras, radar and an acoustic system designed to map a body of water down to 900 feet. The company has been testing the boat out in the world since last February and is set to begin full-scale production at a rate of a boat a week.


Robert Long, a Philosophy fellow at argues that cognitive science isn't helpful inspiration for AI development. He argues that we are not good at cognitive science & AI systems have little use for built-in human-like solutions, especially at scale.

To back this up, researchers at Harvard argue that instead of looking for inspiration, we should look for convergence:

If engineers propose an algorithm and neuroscientists find evidence for it in the brain, it is a pretty good clue that the algorithm is on the right track (at least from the perspective of building human-like intelligence).

Counter argument - Maybe we should stop trying to copy human brains, which are too complex for us to decode and keeps leading us down dead ends. Instead, why not focus on simple brains where both behaviours are incredibly useful, and we can decode/copy the algorithm...

Insects are the perfect place to start; with a few only million neurons, a bee can explore a new environment, find a flower and plot its way home via the quickest path, all while avoiding being swatted. This innate behaviour has been decoded and modelled in code... seems useful.

Cognitive scientists are studying LLMs to understand the emergence of intelligence with scale. GPT-3 was trained on 4 billion words, PaLM and Chinchilla were trained on around a trillion words. This is 4-5 orders of magnitude more than a human child.

Which factors account for that gap? Michael C. Frank, a Stanford Cognitive scientist, thinks there could be four factors-

  • Innate knowledge. Humans have some innate perceptual and/or conceptual foundation that we use to bootstrap concepts from our experience. There's a lot of disagreement on this.
  • Multi-modal grounding. Human language input is often grounded in one or more perceptual modalities, especially for young children. This grounding connects language to rich information for world models that can be used for broader reasoning.
  • Active, social learning. Humans learn a language in interactive social situations, typically circularized to some degree by the adults around them. After a few years, they use conversation to elicit information relevant to them. Once we ground these models in perceptual input, it's going to be easier for them to do common-sense reasoning with less data
  • Evaluation differences. We're using the wrong metrics to measure success.



Jack Pearson