30 Days of AI in Testing – Days 21-25

Currently doing the 30 Days of AI in Testing offered by the excellent Ministry of Testing.

https://www.ministryoftesting.com/events/30-days-of-ai-in-testing

This blog is a record of Days 21 to 25, rather than having them just on the Club. Its quite a lot of writing and I like to have my writing in one place where possible.

Day 21

Create an AI in Testing Manifesto

This is quite the big task, but its worthwhile thinking big with big ideas! I will take my inspiration for the Agile Manifesto.

clears throat

To empower testers to provide even greater value to their teams and organisations, we will partner with AI.

  • Assistance from AI instead of replacement
  • Seeing for ourselves rather than dismissing AI
  • Transparency of models through testing instead of opacity
  • Testing for fairness and bias instead of accepting the status quo
Principles
  • Data used to train models should be from sources where the original owner has given permission.
  • Where AI can help us with wasteful or inefficient practices we will approach with an open mind.
  • Utilising AI in testing comes with a responsbility of continuing to hone our own testing skills.
  • AI in testing is best suited to structured, deterministic work, humans are superior explorers.
  • We will be vigilant wherever we see AI being used for nefarious purposes and challenge where safe to do so.
  • If AI is used to threaten the role or skills of a team member, we will show solidarity.
  • We will not accept the first answer given by an AI, the prompt can always be improved.
  • We will endeavour to use the cleanest language possible when interacting with AI, to get the best outcomes.
  • For accessibility, we will go beyond using AI and get real people with differing needs involved.
  • Where AI is used to replace an interaction with a team member, we will challenge this usage of AI.

Phew! That was pretty deep stuff. There have been so many great answers to go through as well.

Day 22

I think I like the idea of approaching this like an ‘enabling team’ in ‘facilitation’ mode in Team Topologies land. How would we spread the word and practices of AI-assisted testing to our development teams.

Broader Skills

The following skills seem important to me:

  • Testing and what makes good testing for a team in their context, rather than applying the same practices to every team, again an enabling team helps by discovering context.
  • Ethical considerations – AI in testing becomes much more about thing not if we can do something, its whether we should. The potential for both good and harm is very large here, an enabling team for AI in testing would help frame moral and ethical considerations. The test here is mostly likely around drift, seems to me with AI it would be easy to gradually drift into dubious territory with what a team may build.
  • Observability – the testing role may become more a watcher and evaluator, looking at prompts and answers from Generative AI and evaluating them critically. You need to be able to see first though, so use of tools like Langchain as well as the specialist observability tooling like Honeycomb, plus standard logging tooling (ELK stack etc) would be useful.
  • Security and privacy – this would be both dealing with the technologists natural skepticism of AI, plus the potential leakiness of all models, how they can be deployed, what models to use for what. Which need to be internally developed and deployed and which can be used in a more public context.
  • Technology research – given that tools and technology change at a ridiculous rate, then a significant portion of your AI in testing effort would be research. Both tools and technology that are specific to testing, but also new capabilities of generative AI’s, architectures, patterns and the rest.

An enabling team helps development teams, find new patterns, tools and technologies, but doesn’t dictate them. Therefore a high level of collaboration and advocacy is required.

Envision Key Roles and Responsibilities

I think early on I would just have a small team and set a focus, rather than getting too heavy.

  • Engaging with teams to find their aspirations and problems that AI in testing could assist with. On top of this:
    • First priority – research and communication to teams of the research.
    • Second – Ethical considerations – help teams understand the implications of what they are building, and using AI in testing.
    • Third – Observability
    • Fourth – Security and privacy.

I might then switch these around depending on how its working out, by asking the teams engaged with so far on a regular basis.

Ideal Team Setup

Probably a small team of 4 or so drawn from a few disciplines. Its hard to say exactly what those would be, I guess I would try:

  • A developer who really values testing
  • An experienced DevOps person who is used to supporting testing
  • An exploratory tester with an interest in both what and how something is built.
  • A UX professional with an interest in user research to engage with teams and judge the impact of AI, both internally and externally.

Day 23

Option 2

Research AI visual testing

This was quite interesting and made some bold claims:

https://www.browserstack.com/guide/how-ai-in-visual-testing-is-evolving

It talks about Computer Vision vs Visual AI. Computer vision allows a machine to derive information from visual stimuli, Visual AI goes beyond that with analysing and learning and apparently ‘forming thoughts.’ They both have a role to play in image classification, such as optical character recognition and object detection. Then object tracking where an object can be recognised and then following it in different contexts. For visual testing, Visual AI locates elements ‘as a human would’, even if the underlying element was renamed. The end of the article is surprisingly forthright that visual testing creates tons of false positives, even when augmented with AI capabilities.

Critique the AI approach

The article lists three scenarios where AI still struggles to recognise valid positives and negatives:

  1. Anti aliasing of text with pixel by pixel comparison being too sensitive a medium.
  2. Some parts in the app may change, click through adverts for example or dynamic campaigns.
  3. Different browser engines mean images and fonts are treated differently but not indicative of a problem.

I understand why it is desirable to introduce a learning system into this process, as going through endless screenshots comparing them by eye is extremely tedious. Visual testing without some kind of AI augmentation seems very limited to me.

Assess if AI visual testing is for you

As I’m working on a Flutter app, and using golden tests for visual comparison (with or without AI) is a better option I think. It can be done at individual widget level which makes the comparison easier for people and also more programmatically stable. It doesn’t however visually test everything together in a page. That still needs to be done by a human. Perhaps some limited comparisons might be OK to add, but I would be extremely wary of committing wholesale to visual testing.

Day 24

I have gone with Option 2, although both have some interest to me. One of the dependencies I currently work on uses jwt’s to pass information around securely, a not uncommon task. I will use the decode method, in dart.

Added the following prompt to Claude:

  • Instruction – Explain how the following dart code works:
  /// Decode a string JWT token into a `Map<String, dynamic>`
  /// containing the decoded JSON payload.
  ///
  /// Note: header and signature are not returned by this method.
  ///
  /// Throws [FormatException] if parameter is not a valid JWT token.
  static Map<String, dynamic> decode(String token) {
    final splitToken = token.split("."); // Split the token by '.'
    if (splitToken.length != 3) {
      throw FormatException('Invalid token');
    }
    try {
      final payloadBase64 = splitToken[1]; // Payload is always the index 1
      // Base64 should be multiple of 4. Normalize the payload before decode it
      final normalizedPayload = base64.normalize(payloadBase64);
      // Decode payload, the result is a String
      final payloadString = utf8.decode(base64.decode(normalizedPayload));
      // Parse the String to a Map<String, dynamic>
      final decodedPayload = jsonDecode(payloadString);

      // Return the decoded payload
      return decodedPayload;
    } catch (error) {
      throw FormatException('Invalid payload');
    }
  }
  • Context – Our application receives a JWT when a customer authenticates and then the application decodes and uses the content of the JWT throughout the app to access paid for features.
  • Role – please assume the role of a software developer explaining to a product owner
  • Output Indicator – please provide a narrative of how the code works to be disseminated to a non technical audience.

Claude responded with:

Here's a narrative explaining how the provided code works, tailored for a non-technical audience:

When a customer authenticates with our application, they receive a special code called a JSON Web Token or JWT. This JWT contains important information about the customer, such as their user ID, email address, and which features they have access to based on their subscription level.

However, this JWT is encoded in a way that our application can't directly understand. It's like receiving a message written in a secret code or language. So, we need a way to decode or translate this JWT into something our application can read and understand.

The code you provided is responsible for decoding the JWT. It takes the encoded JWT as input and outputs a regular map or dictionary containing the decoded information in a format our application can easily work with.

Here's a step-by-step breakdown of what the code does:

1. First, it splits the JWT into three parts separated by periods (.). A valid JWT always has this structure.

2. If the JWT doesn't have exactly three parts, it means the token is invalid, and the code throws an error.

3. If the JWT is valid, it takes the second part, which is the actual encoded payload or message containing the customer's information.

4. This encoded payload is then normalized and decoded from a special format called Base64 into a regular string of text.

5. The decoded string is then parsed into a map or dictionary, where each piece of information (like the user ID or email) is assigned a key.

6. Finally, this map containing the decoded customer information is returned and can be used throughout our application to control access to features based on the customer's subscription level.

If any part of this process fails, such as if the encoded payload is corrupted or invalid, the code throws an error to let our application know something went wrong with decoding the JWT.

In summary, this code acts as a translator, taking the encoded JWT our application receives during customer authentication and converting it into a readable format our application can understand and use to manage customer access and functionality.

In terms of evaluation:

  • The step by step breakdown is mostly not communicable to a non technical audience, doesn’t actually explain what is happening.
  • The other parts of the response are useful though and compare to easier to understand concepts (translator, secret language).
  • It doesn’t really address why use a JWT, although I didn’t put it in the prompt.

When I asked Claude for any improvements, this was addressed:

  • You could have provided a sample encoded JWT token to make the example more concrete
  • More background on what JWT is and why it’s used could be helpful for a totally non-technical audience
  • Clarification on whether I should just explain the decode() function or the entire JWT flow

Overall, asking a Generative AI to tell you what code does with context with what you want to do with it seems a little trite, so we should endeavour to provide more to get a better answer.

Day 25

Find an article about AI augmented security testing

I chose this by GitHub:

https://github.blog/2023-11-08-ai-powered-appsec/

GitHub host a lot of code obviously and would have access to a lot of training data for a security AI model, so I thought they may have some interesting capabilities and insights.

It starts well, saying that we shouldn’t try and inspect security into our applications, but get that feedback as soon as possible. It mentions four features they have worked on:

  • CodeQL Auto suggestions and fixes – this looks very interesting, recommending security fixes such as rate limiting as shown in the little video on the page. It gives a fairly basic example, do a HTTP GET and then do a direct SQL query, which is definitely a security problem. One would hope that organisations would be able to hit that minimum, or maybe experience teaches us otherwise!
  • Passwords in source control detection – raises a good point, as secrets are often unstructured and hard to detect, then an LLM could learn to detect them effectively. I suppose its easier to detect the fields which are literally called passwords, but more subtle secrets, API keys for example, which may just look like guids used in tests for example.
  • Regular Expression Custom Patterns for Secrets – this is nice, you can add patterns to detect for within your codebase. You can generate with AI with a simple form for descriptions and examples, which is a decent way to build a good prompt, with a selection of results to use. I would be very interested to test this, as the road to hell is paved with regular expressions.
  • Dashboard – Showing risk, remediation and prevention across multiple projects.

Like many scanners (even AI augmented ones) can give the illusion of making a difference to a non functional area but can be deceptive. Recommending rate limiting in the auto suggestions is a distant second to having a sensible architecture! I fear that teams might be tempted by scanners detecting problems they don’t understand and then auto fixes that might be wrong that they don’t understand. The password work is good though and a nice baseline with fairly universal understanding. Also a starter dashboard across multiple projects is a nice touch, although as an invitation to test better, rather than a goal to hit.