Sony developers share how machine learning can improve QA
During a talk given at the recent CEDEC event in Yokohama, Japan, development leads within Sony discussed their recent efforts to implement AI and machine learning models to improve efficiency and accuracy within the QA process.
The talk was led by machine learning researchers from the company’s Game Service R&D department, Hiroyuki Yabe and Yutaro Miynotauchi, alongside Nakahara Hiroki, a software engineer focused on software QA engineering. It was aimed at priming fellow creators on the ways the company had integrated AI into the QA process using real PS5 hardware, collecting only on-screen and audio information similar to human-driven Q&A while allowing for titles to be tested more regularly and with greater efficiency.
More regular testing in this fashion accomplished autonomously allowed teams to eliminate more bugs earlier thanks to more regular testing, as manual testing can otherwise only be conducted a few times per development cycle and a bug caught too late in development has a chance of impacting release.
For this talk, the team shared their findings using the software to automate QA operations in PlayStation 5 launch title Astro’s Playroom. This was notable as one key feature requiring extensive QA testing was the integration of game progress with hardware functionality such as the PS5’s Activity Cards, which could track progress on particular objectives as players made their way through a level.
Replay Agent and Imitation Agent
When researching how to integrate the technology into the testing process, the team had a few conditions that needed to be met: any testing system must not rely on game-specific tools that would then need to be remade for use in other games – in other words, AI testing for a shooting game mustn’t rely on aim assistance that can’t be applied to a platformer or another shooter, and so on.
It also must be achievable at a realistic cost that makes such automation worthwhile and it must also be simple enough that even those without technical experience could create an Imitation Agent and run the testing simulation.
This resulted in the case of Astro’s Playroom in the automation of QA through the use of two separate automated play systems: a Replay Agent and an Imitation Agent. The former system functioned by replicating exact button combinations to ensure consistency, and would be used in select circumstances such as to navigate in-game UI and the PS5 hardware menus, or moments such as moving from a spawn point to a level transition where no variables can impact movement.
Meanwhile, the Imitation Agent would reproduce human play with variance. Both systems were achieved by connecting a PS5 to a PC where on-screen information could be sent to the learning module before controller inputs were sent back to the hardware.
These tools could also be used in sequence: in one video example, a replay agent could be used to navigate the UI of Astro’s Playroom or move from the hub world to a level, before the Imitation Agent would take over to play a level. Typically a scene transition would be used to signify this change, such as opening the Activity Card menu when entering a level to signify a transition between the two systems in a reproducible manner.
As explained by Yabe, “For the Imitation Agent, we created a machine learning model that could recreate human gameplay and use that to test sections of play that could not be exactly reproduced. To do so, we would have human testers play a section numerous times and upload it into the model. In the case of Astro’s Playroom we had testers play each section roughly ten [to] 20 times in order to get a representative sample. We would feed this data into the machine learning system, and from there use it to replicate the human gameplay for further testing.”
“We created a machine learning model that could recreate human gameplay and use that to test sections of play that could not be exactly reproduced”Hiroyuki Yabe
This would then allow the team to repeatedly test these sections further to ensure no bugs had been overlooked. This sort of machine learning was necessary for testing areas where exact reproduction of inputs would be impossible, such as areas where players had free reign over the camera and viewpoint, or scenes where enemy AI could react to player actions and attack the player in a non-set pattern. In such scenarios, exact input reproduction would not produce useful results or allow for a machine to complete the level, as these factors are not stable over repeated sessions.
To assist the machine learning models, other AI systems such as LoFTR (Detector-Free Local Feature Matching) would be used to help the system recognize a scene as being identical to those within the model, even if things such as camera angle and player position were different to the input provided to the system. In testing where the automated testing model would revert between the Replay Agent and Imitation Agent, such knowledge would be crucial in the game understanding when it had hit a transitional scene to switch between useful agents.
As Yabe noted, “The model of the mimetic agent requires only the game screen information as input. When the game screen information is input, it is set to output the state of the controller in the next frame, and by running [the recording model] at ten frames per second, it is able to determine operations in real-time. The imitation agent targets all scenes to which the replay agent cannot be applied.”
That being said, some simplification and guidance was required to ensure the game could truly learn the environments using the play data provided. For example, rather than dealing with raw analogue input, this would be simplified into nine quadrants of movement that could be more effectively managed by the system. In recreating human play, the model would also use probability to determine button presses in a particular moment from the data it was provided.
Reflecting human play
Another note was the need to integrate Class Balance into the training data to ensure greater chances of success, especially when dealing with a small learning sample as would be expected in such cases. A model trained indiscriminately on a generic set of data may be biased to outcomes that lead to a successful clear but don’t reflect human play. Meanwhile, infrequent tasks with large impact, such as picking up essential items for progress that may fall randomly upon defeating an enemy, are difficult for machine learning to adopt.Balance was introduced to prioritize such tasks and make it viable that it could be used even in such circumstances.
As Yutaro Miyauchi explained, “it is not uncommon in games for there to be moments where it is necessary to press a button to pick up an item that’s fallen at a random point yet is essential for progress. However, such actions that appear infrequently but have a large impact on the ability to clear a level are difficult for machine learning, and it’s difficult to create a model for this. We used Class Balance to adjust the degree of influence that learning has within our model so more weight is given to important operations that appear less frequently so they are reflected more strongly in the model.”
Models would also train it on data that would assist it in learning how to go from failed states (running into walls, for example) and return to standard play, to ensure it could better reflect human play and not find itself playing in an unnatural manner not conducive to effective testing.
In one example shown during the talk, button press and analogue movement probabilities were shown both with and without balance in learning outcomes, and the results showed stark differences. In the balanced model, the movement of Astro Bot through the level was reflective of the way a human would move through the world and it could effectively clear jumps or ledges, while the unbalanced system would constantly run against walls or hit obstacles in its path, even if it may eventually clear its goal (or in many cases, not).
By inputting balance to the data, not only could the model be effectively trained using fewer data sets, it was able to better adapt to the world of one game and quickly adapt to new games in the same genre by creating a base model for select genres that could be applied across titles.
Although the system continues to be refined, the researchers noted numerous benefits and drawbacks to the model during their experience testing automated QA throughout the development process of this and other titles. Using two games, game A and B, as examples, they noted that in game A, even with extensive trained data of human play of an area of the game it would not always be possible for the agent to clear the game using the data provided. This would then require new or additional data to be obtained that could extend the time needed to test beyond what could have been achieved with manual human testing.
However, in the case of game B, the human data collection for the automated system could take one hour to produce the human testing equivalent of 50 hours, massively speeding up QA to overall bring down the number of man-hours required to facilitate automation to a number below what would be required to achieve the same results through human testing.
Additionally, as the system was not currently entirely self-sufficient, and can’t act with complete autonomy in QA, it does still require human input to some extent for effective results. While responding to audience questions following the talk, Yabe admitted that when parameters were changed within a level such as the placement of enemies and platforms, prior machine learning data would no longer be effective. At this point, a new machine learning model would need to be created, or the area would need to be tested manually, limiting the model to more feature-complete sections of gameplay.
As the system was not entirely self-sufficient, and can’t act with complete autonomy in QA, it does still require human input to some extent for effective results
On the whole, however, the use of automated testing allowed the team to improve efficiency in their QA process compared to an entirely human-driven approach. This machine learning model did not entirely eliminate the necessity for human testers, but instead allowed for more frequent testing throughout development to allow for earlier detection of bugs. In addition, further testing on more titles showed the system has continued to be refined, with the expectation that the model can continue to improve over time.
Although the use of machine learning for large language models and generative AI has come under scorn and faced pushback both within and outside development circles, these models used in other scenarios provide tangible benefits to those creating games. The use of these AI models has not replaced the need for QA specialists – not all testing is quicker with machines versus human-driven QA – but has instead integrated the process of QA further into the development process.
Rather than leaving such bug fixing and QA until the end of development, by which point some complex issues could be more deeply integrated into the fabric of the game’s programming due to a lack of early detection, QA can be repeated throughout the development process whenever new features and levels are complete.
The development of machine learning systems in the QA process makes such early detection and bug fixing more streamlined and effective for developers to enact, improving the quality and reducing the number of bugs in titles shipped to the public, all while using tools other developers can seek to emulate by developing and enacting their own machine learning modules.