Advanced Architecture for AI Application (AKA AAAA!)

This post covers infrastructure improvement ideas for an existing AI application's architecture for better delivery, performance, and cost-reduction.

Surprise! This is a bonus blog post for the AI for Web Devs series I recently wrapped up. If you haven’t read that series yet, I’d encourage you to check it out.

This post will look at the existing project architecture and ways we can improve it for both application developers and the end user.

I’ll be discussing some general concepts, and using specific Akamai products in my examples.

Basic Application Architecture

The existing application is pretty basic. A user submits two opponents, then the application streams back an AI-generated response of who would win in a fight.

The architecture is also simple:

  1. The client sends a request to a server.
  2. The server constructs a prompt and forwards the prompt to OpenAI.
  3. OpenAI returns a streaming response to the server.
  4. The server makes any necessary adjustments and forwards the streaming response to the client.

I used Akamai’s cloud compute services (formerly Linode) but this would be the same for any hosting service, really.

Architecture diagram showing a client connecting to a server inside the Cloud, which forwards the request to OpenAI, then returns to the server and back to the client.
🤵 looks like a server at a fancy restaurant, and 👁️‍🗨️ is “a eye”, or AI. lolz

Technically this works fine, but there are a couple of problems, particularly when users make duplicate requests. It could be faster and more cost-effective to store responses on our server and only go to OpenAI for unique requests.

This assumes we don’t need every single request to be non-deterministic (the same input produces a different output). Let’s assume it’s OK for the same input to produce the same output. After all, a prediction for who would win in a fight wouldn’t likely change.

Add Database Architecture

If we want to store responses from OpenAI, a practical place to put them is in some sort of database that allows for quick and easy lookup using the two opponents. This way, when a request is made, we can check the database first:

  1. The client sends a request to a server.
  2. The server checks for an existing entry in the database that matches the user’s input.
  3. If a previous record exists, the server responds with that data, and the request is complete. Skip the following steps.
  4. If not, the server follows from step three in the previous flow.
  5. Before closing the response, the server stores the OpenAI results in the database.
Architecture diagram showing a client connecting to a server inside the Cloud, which checks for data in a database, then optionally forwards the request to OpenAI to get the results, then returns the data back to the client.
Dotted lines represent optional requests, and the 💽 kind of looks like a hard disk.

With this setup, any duplicate requests will be handled by the database. By making some of the OpenAI requests optional, we can potentially reduce the amount of latency users experience, plus save money by reducing the number of API requests.

This is a good start, especially if the server and the database exist in the same region. It would make for much quicker response times than going to OpenAI’s servers.

However, as our application becomes more popular, we may start getting users from all over the world. Faster database lookups are great, but what happens if the bottleneck is the latency from the time spent in flight?

We can address that concern by moving things closer to the user.

Bring in Edge Compute

If you’re not already familiar with the term “edge”, this part might be confusing, but I’ll try to explain it simply. Edge refers to content being as close to the user as possible. For some people, that could mean IoT devices or cellphone towers, but in the case of the web, the canonical example is a Content Delivery Network (CDN).

I’ll spare you the details, but a CDN is a network of globally distributed computers that can respond to user requests from the nearest node in the network (something I’ve written about in the past). While traditionally they were designed for static assets, in recent years, they started supporting edge compute (also something I’ve written about in the past).

With edge compute, we can move a lot of our backend logic super close to the user, and it doesn’t stop at compute. Most edge compute providers also offer some sort of eventually-consistent key-value store in the same edge nodes.

How could that impact our application?

  1. The client sends a request to our backend.
  2. The edge compute network routes the request to the nearest edge node.
  3. The edge node checks for an existing entry in the key-value store that matches the user’s input.
  4. If a previous record exists, the edge node responds with that data and the request is complete. Skip the following steps.
  5. If not, the edge node forwards the request to the origin server, which passes it along to OpenAI and yadda yadda yadda.
  6. Before closing the response, the server stores the OpenAI results in the edge key-value store.
The edge node is the blue box and represented by 🔪 because it has an edge, EdgeWorker is Akamai’s edge compute product represented by 🧑‍🏭, and EdgeKV is Akamai’s key-value store represented by 🔑🤑🏪. The edge box is closer to the client than the origin server in the cloud to represent physical distance.

The origin server may not be strictly necessary here, but I think it’s more likely to be there. For the sake of data, compute, and logic flow, this is mostly the same as the previous architecture. The main difference being the previously stored results now exist super close to users and can be returned almost immediately.

(Note: although the data is being cached at the edge, the response is still dynamically constructed. If you don’t need dynamic responses, it may be simpler to use a CDN in front of the origin server and set the correct HTTP headers to cache the response. There’s a lot of nuance here, and I could say more but…well, I’m tired and don’t really want to. Feel free to reach out if you have any questions.)

Now we’re cooking! Any duplicate requests will be responded to almost immediately, while also saving us unnecessary API requests.

This sorts out the architecture for the text responses, but we also have AI-generated images.

Cache Those Images

The last thing we’ll consider today is images. When dealing with images, we need to think about delivery and storage. I’m sure that the folks at OpenAI have their own solutions, but some organizations want to own the entire infrastructure for security, compliance, or reliability reasons. Some may even run their own image generation services instead of using OpenAI.

In the current workflow, the user makes a request that ultimately makes its way to OpenAI. OpenAI generates the image but doesn’t return it. Instead, they return a JSON response with the URL for the image, hosted on OpenAI’s infrastructure. With this response, an <img> tag can be added to the page using the URL, which kicks off another request for the actual image.

If we want to host the image on our own infrastructure, we need a place to store it. We could write the images onto the origin server’s disk, but that could quickly use up the disk space, and we’d have to upgrade our servers, which can be costly. Object storage is a much cheaper solution (I’ve also written about this). Instead of using the OpenAI URL for the image, we could upload it to our own object storage instance and use that URL instead.

That solves the storage question, but object storage buckets are generally deployed to a single region. This echoes the problem we had with storing text in a database. A single region may be far away from users, which could cause a lot of latency.

Having introduced the edge already, it would be pretty trivial to add CDN features for just the static assets (frankly, every site should have a CDN). Once configured, the CDN will pull images from object storage on the initial request and cache them for any future requests from visitors in the same region.

Here’s how our flow for images would look:

  1. Client sends a request to generate an image based on their opponents
  2. Edge compute checks if the image data for that request already exists. If so, it returns the URL.
  3. The image is added to the page with the URL and the browser requests the image.
  4. If the image has been previously cached in the CDN, the browser loads it almost immediately. This is the end of the flow.
  5. If the image has not been previously cached, the CDN will pull the image from the object storage location, cache a copy of it for future requests, and return the image to the client. This is another end of the flow.
  6. If the image data is not in the edge key-value store, the request to generate the image goes to the server and on to OpenAI, which generates the image and returns the URL information. The server starts a task to save the image in the object storage bucket, stores the image data in the edge key-value store, and returns the image data to edge compute.
  7. With the new image data, the client creates the image which creates a new request and continues from step five above.
Architecture diagram showing a client connecting to an edge node which checks the edge key-value store, then optionally passes the request to a cloud server and on to OpenAI before returning the data to the client. Additionally, if the user makes a request for an image, the request will check a CDN first, and if it doesn't exist, will pull it from Object Storage where it was placed from OpenAI
Content delivery network denoted by delivery truck (🚚) and network signal (📶), and object storage denoted by socks in a box (🧦📦), or objects in storage. This caption is probably not necessary, as I think these are clear, but I’m too proud of my emoji game and require validation. Thank you for indulging me. Carry on.

This last architecture is, admittedly, a little bit more complex, but if your application is going to handle serious traffic, it’s worth considering.


Right on! With all those changes in place, we have created AI-generated text and images for unique requests and serve cached content from the edge for duplicate requests. The result is faster response times and a much better user experience (in addition to fewer API calls).

I kept these architecture diagrams applicable across various databases, edge compute, object storage, and CDN providers on purpose. I like my content to be broadly applicable. But it’s worth mentioning that integrating the edge is about more than just performance. There are a lot of really cool security features you can enable as well.

For example, on Akamai’s network, you can have access to things like web application firewall (WAF), distributed denial of service (DDoS) protection, intelligent bot detection, and more. That’s all beyond the scope of today’s post, though.

So for now, I’ll leave you with a big “thank you” for reading. I hope you learned something. And as always, feel free to reach out any time with comments, questions, or concerns.

Thank you so much for reading. If you liked this article, and want to support me, the best ways to do so are to share it, sign up for my newsletter, and follow me on Twitter.

Originally published on


  1. Thank you for another fine article. This is a note prompted by my frustration with OpenAT giving obsolete information in its answers:
    I asked ChatGPT “Does a human review these interactions?” The answer shows a major problem with AI: “No, a human does not directly review these interactions. As an AI language model developed by OpenAI, I generate responses based on patterns learned from vast amounts of text data. ”
    This shows that when the preponderance of information is based on old, outdated standards, AI’s answers will be based on obsolete standards. For AI, 30 years of text holds more weight than 10 years of a new standard.
    My questions related to what users of Assistive Technology expect when using the Tab key to navigate a form. Does the user expect or appreciate validation when tabbing out of an empty, but required field?
    I hope that Gil will answer this as it is not easy for someone who does have or use AT to visualize how it will be used.
    Thank you.

    • Hey Fred, thanks for the comment. That’s some good insight on AI. I think what we’ll see is different AI’s for different use-cases. Those designed for answering questions will weigh recent data more. Those designed for generating content will weigh a broader scope of work more. Ultimately, there are several different areas that will need their own type of specialization and form-factors. Hope that helps.

      Regarding AT users tabbing through forms, I’m not the BEST person to ask because I represent a visual and physically capable person. I like tab keys to move through forms, but I dont need it. The best thing you can do is seek out the people that are using AT and ask them. However, I do know that there are things we can do as developers to provide more information about validation as users ENTER form fields rather than waiting for them to EXIT. This can help fill in the fields correctly the first time.

Leave a Reply

Your email address will not be published. Required fields are marked *