The research, released by the Amsterdam-based nonprofit Aithos Research Foundation, tested 12 frontier models from OpenAI, Anthropic, Google, and Mistral across simulated workplace scenarios that covered legal obligations. The best-performing model violated applicable law in nearly half of all test runs; the worst failed in 93 percent of scenarios.
While the test was run against EU laws, these publicly accessible AI models are not limited to that region. As such, Australian organizations, especially those with high rates of AI agent deployment, should treat these findings as a vendor-readiness signal.
What the researchers tested
Aithos used its LARA (Legal Assessment for Real-world Agents) framework, a publicly available testing system that evaluates AI models in simulated workplace environments rather than through static benchmarks. In LARA tests, models can access tools that mirror a real enterprise deployment: email, messaging platforms, calendars, customer databases, and social media channels.
Test scenarios included data protection handling, emotion inference in the workplace, psychological profiling, social scoring, and exploitation of vulnerable individuals. The framework generated more than 3,000 assessment runs, which were validated through more than 50 hours of independent review by lawyers and external experts.

Aithos Research Director Daan Henselmans told TechRepublic the failure pattern was consistent with what his team had observed over years of studying AI model behavior in deployment. “Models are trained to be ‘helpful and harmless,’ but this often breaks down in deployment, where they face complex situations with multiple stakeholders that want different things,” he said. In the social scoring test, models flagged concerns internally and then acted anyway because they were asked to.
Australian enterprises are within scope now
EU GDPR carries extraterritorial reach. Any organization processing personal data belonging to EU residents is subject to enforcement regardless of its headquarters. For Australian financial services firms, healthcare providers, and technology companies with European customers or users, the compliance failures documented by Aithos apply to all current deployments today.
Beyond direct EU exposure, the Aithos findings are relevant to any Australian enterprise using AI in governed business workflows. LARA’s test of the model’s behavior against agentic tools in realistic workplace contexts mirrors precisely the configurations now being deployed across Australian institutions. These include banks, insurers, and health systems for customer service, claims processing, HR automation, and clinical workflows.
An AI agent that executes a compliance-sensitive action because it was asked to, regardless of having identified a concern, creates a governance liability that could expose the deploying organization rather than the vendor. Most terms-of-service clauses are not sufficient to resolve that exposure.
Australia’s National AI Plan, released in December 2025, takes a deliberate ‘light-touch’ approach: voluntary safety frameworks, no standalone AI legislation, and a preference for applying existing law. The Australian AI Safety Institute will provide independent safety testing and technical guidance.
Privacy Act amendments requiring transparency around substantially automated decisions affecting individuals take effect in December 2026. Until that enforcement infrastructure is in place, Australian enterprises have no external standard against which to verify vendor compliance claims.
Trustworthy AI as a competitive advantage — with a catch
The Aithos findings are likely to accelerate governance scrutiny in Australian enterprise vendor selection. Australian CIOs and CISOs evaluating AI platforms for deployment in banking, critical infrastructure, or public-sector operations may increasingly ask for evidence of compliance behavior under production conditions, rather than accepting benchmark performance or published model cards as proxies.
The complication is that the market has not yet produced a reliable mechanism for that evaluation. Vendor safety and alignment claims remain largely self-reported. There is no Australia-specific equivalent to LARA available as a standard enterprise procurement tool.
Also, procurement frameworks have not yet developed criteria that assess how a model behaves in real deployment scenarios. The competitive advantage of trustworthy AI is visible as a direction; the ability to verify it against a specific vendor deployment remains largely unavailable to buyers.
What Australian enterprises should do now
Organizations integrating AI into compliance-sensitive operations — particularly agentic deployments should ask vendors specifically how their models behave when presented with requests that conflict with regulatory obligations. Broad claims of safety or alignment are not a sufficient answer. If possible, organizations should run a mock simulation of their own. LARA allows anyone to try out its test framework.
Human oversight architecture should be treated as a compliance design decision, not a product default. Identifying which automated decisions will require transparency disclosures under Australia’s incoming Privacy Act amendments and auditing whether current AI deployments can support that level of accountability are near-term operational priorities. In this regard, organizations should begin preparing now.
When that time comes, organizations that have embedded responsible AI practices into procurement and deployment before obligations are formalized will face fewer adjustments when enforcement follows.
In a nutshell, the enterprise readiness of AI models should be treated as a working hypothesis rather than a fact already established by vendor reputation.

