Generative AI is actually a simple concept despite its intimidating label. It pertains to AI algorithms that produce an output, like text, photo, video, code, data, and 3D renderings, from the data they are trained on. It focuses on generating content instead of other AI forms with different uses, such as analysing data or assisting with self-driving car control.
Why is generative AI a hot topic right now?
Generative AI programs like OpenAI’s ChatGPT and DALL-E are gaining popularity, leading to buzz around the term “generative AI”. These programs can quickly produce a wide range of content, including computer code, essays, emails, social media captions, images, poems, and raps. This has drawn much attention from people.
Generative AI and language models promise to change how businesses approach to design, support, development, and more. The blog discusses why you should run your own large language models (LLM) instead of relying on new API providers. The blog discusses the evolving tech stack for cost-effective LLM fine-tuning and serving, which combines HuggingFace, Pytorch, and Ray. It also shows how Katonic, a leading MLOps Platform, addresses these challenges and enables data science teams to rapidly develop and train generative AI models at scale, including ChatGPT.
Why would I want to run my own LLM?
- Cost, especially for fine-tuned inference: For example, OpenAI charges 12c per 1000 tokens (about 700 words) for a fine-tuned model on Davinci. It’s important to remember that many user interactions require multiple backend calls (e.g. one to help with the prompt generation, post-generation moderation, etc), so it’s very possible that a single interaction with an end user could cost a few dollars. For many applications, this is cost prohibitive.
- Latency: using these LLMs is especially slow. A GPT-3.5 query for example can take up to 30 seconds. Combine a few round trips from your data center to theirs and it is possible for a query to take minutes. Again, this makes many applications impossible. Bringing the processing in-house allows you to optimize the stack for your application, e.g. by using low-resolution models, tightly packing queries to GPUs, and so on. We have heard from users that optimizing their workflow has often resulted in a 5x or more latency improvement.
- Data Security & Privacy: In order to get the response from these APIs, you have to send them a lot of data for many applications (e.g. send a few snippets of internal documents and ask the system to summarize them). Many of the API providers reserve the right to use those instances for retraining. Given the sensitivity of organizational data and also frequent legal constraints like data residency, this is especially limiting. One, particularly concerning recent development, is the ability to regenerate training data from learned models, and people unintentionally disclosing secret information.
OK, so how do I run my own?
The LLM space is an incredibly fast-moving space, and it is currently evolving very rapidly. What we are seeing is a particular technology stack that combines multiple technologies:
Recent results on Dolly and Vicuna (both trained on Ray or trained on models built with Ray like GPT-J) are small LLMs (relatively speaking – say the open source model GPT-J-6B with 6 billion parameters) that can be incredibly powerful when fine-tuned on the right data. The key is fine-tuning and the right data parts. So you do not always need to use the latest and greatest model with 150 billion-plus parameters to get useful results. Let’s get started!
Challenges in Generative AI infra
Generative AI infrastructure presents new challenges for distributed training, online serving, and offline inference workloads.
Distributed training is different from your normal training workflows. Distributed training always applies to comprehensive training scenarios like NLP and computer vision models, for which the normal dataset or model training can’t be fit on a single machine. In the distributed training world, the strategy is to distribute both the data and the model onto different machines so that they can parallel execute the training request.
Common challenges for distributed training for generative models include:
- How to effectively partition the model across multiple accelerators?
- How to setup your training to be tolerant of failures on preemptible instances?
Some of the largest scale generative model training is being done on Ray today:
- OpenAI uses Ray to coordinate the training of ChatGPT and other models.
- The Alpa project uses Ray to coordinate training and serving of data, model, and pipeline-parallel computations with JAX as the underlying framework.
- Cohere and EleutherAI use Ray to train their large language models at scale along with PyTorch and JAX.
Fig. Alpa uses Ray as the underlying substrate to schedule GPUs for distributed training of large models, including generative AI models.
Online serving and fine-tuning.
Generative AI requires medium-scale workloads (e.g., 1-100 nodes) for training and fine-tuning. Typically, users at this scale are interested in scaling out existing training or inference workloads they can already run on one node (e.g., using DeepSpeed, Accelerate, or a variety of other common single-node frameworks). In other words, they want to run many copies of a workload for purposes of deploying an online inference, fine-tuning, or training service.
Fig. A100 GPUs, while providing much more GRAM per GPU, cost much more per gigabyte of GPU memory than A10 or T4 GPUs. Multi-node Ray clusters can hence serve generative workloads at a significantly lower cost when GRAM is the bottleneck.
Doing this form of scale-out itself can be incredibly tricky to get right and costly to implement. For example, consider the task of scaling a fine-tuning or online inference service for multi-node language models. There are many details to get right, such as optimizing data movement, fault tolerance, and autoscaling of model replicas. Frameworks such as DeepSpeed and Accelerate handle the sharding of model operators, but not the execution of higher-level applications invoking these models.
However, it is challenging to scale deployments involving many machines. It is also difficult to drive high utilization out of the box.
Offline batch inference
On the offline side, batch inference for these models also has challenges in requiring data-intensive preprocessing followed by GPU-intensive model evaluation. Companies like Meta and Google build custom services (DPP, tf.data service)to perform this at scale in heterogeneous CPU/GPU clusters. While in the past such services were the rarity, we are more and more often seeing users ask how to do this in the context of generative AI inference. These users now also need to tackle the distributed systems challenges of scheduling, observability, and fault tolerance.
How Katonic addresses these challenges
Distributed processing is the best way to scale machine learning. Apache Spark is an easy default option. Spark is a popular distributed framework; it works well for data processing and “embarrassingly parallel” tasks. For machine learning, however, Ray is a better option.
Ray is an open-source, unified computing framework that simplifies scaling AI and Python workloads. Ray is great for machine learning as it can leverage GPUs and handle distributed data. It includes a set of scalable libraries for distributed training, hyperparameter optimization, and reinforcement learning. In addition, its fine-grained controls let you adjust processing to the workload using distributed actors and in-memory data processing.
Today, Ray is used by leading AI organizations to train large language models (LLM) at scale (e.g., by OpenAI to train ChatGPT, Cohere to train their models, EleutherAI to train GPT-J, and Alpa for multi-node training and serving). However, one of the reasons why these models are so exciting is that open-source versions can be fine-tuned and deployed to address particular problems without needing to be trained from scratch. Indeed, users in the community are increasingly asking how to use Ray for the orchestration of their own generative AI workloads, building off foundation models trained by larger players.
However, getting from point A to a Ray cluster may not be so simple:
- Setup: Setting up the configuration for a cluster can be complex, and the required skillset becomes more demanding as the number of nodes increases.
- Hardware: To maximise its benefits, Ray may need access to Robust infrastructure and GPUs to work efficiently.
- Data access: To ensure efficient data transfer between your storage tools and Ray cluster, it is essential to establish a seamless and swift connection. However, creating a data pathway between the boxes on a whiteboard and achieving data connectivity can be complex.
- Security and governance: The Ray clusters must satisfy the access control requirements and adhere to the internal data encryption and auditing guidelines.
- Scalability: To avoid overspending on infrastructure, it is important to plan carefully so that clusters can handle increasing workloads without issues.
To use Ray, many companies look to provision and manage dedicated clusters just for Ray jobs. Your team doesn’t have a lot of extra cycles for DevOps, nor does IT right now. But you will end up paying for that cluster while it sits idle between Ray jobs.
Alternatively, you can subscribe to a Ray service provider. That eliminates the DevOps problem. But you’ll have to copy your data to the provider’s datastore or go through the treacherous process of connecting it to your data. It also means multiple logins and collaboration platforms to manage. You want to use Ray for some of your projects, not all of them.
Katonic offers a cost-effective and secure solution. Katonic 4.0 now supports Ray open-source framework, which enables data science teams to rapidly develop and train generative AI models at scale, including ChatGPT.
This solution involves configuring and orchestrating a Ray cluster directly on the infrastructure that supports the Katonic platform. With Katonic, your users can spin up Ray clusters when needed. Katonic automates the DevOps away; your team can focus on delivering quality work.
The integration with Katonic on-demand, auto-scaling compute clusters streamlines the development process while also supporting data preparation via Apache Spark and machine learning and deep learning via XGBoost, TensorFlow, and PyTorch.
That means your Ray clusters can be on-prem or on any major cloud provider without waiting for IT, DevOps, or the cloud provider to catch up with industry innovation. As always with Katonic, your data is connected centrally, and access controls and audit trails are built-in. Best of all, you get a head start on the competition.
We have shown how Katonic AI Platform combines Ray, HuggingFace, and PyTorch to offer a solution that:
- Makes it simple and quick to deploy as a service.
- Can be used to cost-effectively fine-tune and is actually most cost-effective when using multiple machines without the complexity.
- How fine-tuning – even a single epoch – can change the output of a trained model.
- Deploying a fine-tuned model is only marginally harder than deploying a standard one.
Our upcoming blog post will provide a step-by-step guide on how to efficiently use Hugging Face and Ray in combination with the katonic MLops platform. This will enable you to create a system for fine-tuning and serving LLMs, regardless of model size, in under 40 minutes and at a cost of less than $7 for a 6 billion parameter model. Stay Tuned !!