Model-less Inference Serving
INFaaS is an inference-as-a-service platform that makes inference accessible and easy-to-use by abstracting resource management and model selection. Users simply specify their inference task along with any performance and accuracy requirements for queries.
INFaaS runs on AWS (with other provider platform support coming soon).
There are a few AWS-specific setup steps, all of which can be accomplished from the AWS dashboard:
AmazonEC2FullAccess
and AmazonS3FullAccess
.m5.2xlarge
instance.
We provide a public AMI (ami-036de08e2e59b4abc) in us-west-2 (that you can copy to your region) that contains the pre-installed dependencies.
The instance should have the IAM Role and Security Group you created in the One-time Setup attached to it.git clone https://github.com/stanford-mast/INFaaS.git
.start_infaas.sh
and fill in the following entries. Entries between <> must be filled in prior to using INFaaS; the rest are set to defaults which can be changed based on your desired configuration.
###### UPDATE THESE VALUES BEFORE RUNNING ######
REGION='<REGION>'
ZONE='<ZONE>'
SECURITY_GROUP='<SECURITYGROUP>'
IAM_ROLE='<IAMROLE>'
MODELDB='<MYMODELDB>' # Model repository bucket (do not include s3://)
CONFIGDB='<MYCONFIGDB>' # Configuration bucket (do not include s3://)
WORKER_IMAGE='ami-<INFAASAMI>'
NUM_INIT_CPU_WORKERS=1
NUM_INIT_GPU_WORKERS=0
NUM_INIT_INFERENTIA_WORKERS=0
MAX_CPU_WORKERS=1
MAX_GPU_WORKERS=0
MAX_INFERENTIA_WORKERS=0
SLACK_GPU=0 # Used for making popular GPU variants exclusive, set to 0 for no GPU to be used as exclusive
KEY_NAME='worker_key'
MACHINE_TYPE_GPU='p3.2xlarge'
MACHINE_TYPE_CPU='m5.2xlarge'
MACHINE_TYPE_INFERENTIA='inf1.2xlarge'
DELETE_MACHINES='2' # 0: VM daemon stops machines; 1: VM daemon deletes machines; 2: VM daemon persists machines, but removes them from INFaaS's view
Note: If you would like to run the example below, you can either set CONFIGDB to be infaas-sample-public/configs or copy its contents over to your own configuration bucket using the AWS CLI:
aws s3 sync s3://infaas-sample-public/ s3://your-config-bucket/ --exclude "resnet*"
./start_infaas.sh
from the INFaaS home directory (i.e., the directory that start_infaas.sh
is located in).
This will set up all INFaaS components and initial workers, as well as run some basic tests to check that the system is properly set up.
All executables can be found in build/bin
.Currently, users must profile their models and generate a configuration file that can be passed to infaas_modelregistration
.
We plan to make this process more automated in the future, but for now:
src/profiler
./profile_model.sh <frozen-model-path> <accuracy> <dataset> <task> [cpus]
The script is interactive and will prompt you for information needed to profile your model.
Once complete, it will output a configuration (.config) file.
Upload this configuration file to your configuration bucket configured in the One-time Setup. Here is how you would do this with the AWS CLI:
aws s3 cp mymodel.config s3://your-config-bucket/mymodel.config
infaas_modelregistration
as the second parameter.Example
In infaas-sample-public
, we have provided a CPU TensorFlow model, an equivalent TensorRT model optimized for batch-4, and an equivalent Inferentia model optimized for batch-1 on a single Inferentia core. We also provide their respective configuration files that were generated as specified above using src/profiler/profile_model.sh
. Register these models as follows:
./infaas_modelregistration resnet_v1_50_4.config infaas-sample-public/resnet_v1_50_4/
./infaas_modelregistration resnet50_tensorflow-cpu_4.config infaas-sample-public/resnet50_tensorflow-cpu_4/
./infaas_modelregistration resnet50_inferentia_1_1.config infaas-sample-public/resnet50_inferentia_1_1/
If INFaaS is set up correctly, all of these commands should output a SUCCEEDED message.
Information about registered models:
infaas_modarch
.infaas_modinfo
.Example
To see information about the models you registered in the Model Registration example, run ./infaas_modarch classification imagenet
, which should show that resnet50 is the only registered model architecture.
Running ./infaas_modinfo resnet50
should show the three model-variants you registered: resnet_v1_50_4, resnet50_tensorflow-cpu_4, resnet50_inferentia_1_1.
Running queries:
infaas_online_query
.
Running this with no parameters describes the valid input configurations (corresponding with the model-less abstraction, which you can read about more in the second reference paper below).
INFaaS returns the raw output from the model (e.g., output probabilities for each class).infaas_offline_query
.
INFaaS returns whether the job scheduling was successful.
If successfully scheduled, the job can be monitored by checking the output_url
bucket.Example
Note: to run this example, you must have called ./start_infaas.sh
with at least one GPU worker (i.e., NUM_INIT_GPU_WORKERS >= 1 and MAX_GPU_WORKERS >= NUM_INIT_GPU_WORKERS) and one Inferentia worker (i.e., NUM_INIT_INFERENTIA_WORKERS >= 1 and MAX_INFERENTIA_WORKERS >= NUM_INIT_INFERENTIA_WORKERS).
Let’s send an online image classification query to INFaaS and specify the model architecture with a latency constraint. After you have registered the three ResNet50 models from the above Model Registration example, we can first send the request with a relaxed latency constraint (assuming you are running in build/bin
for the image path to work):
./infaas_online_query -d 224 -i ../../data/mug_224.jpg -a resnet50 -l 300
The first time you run the query, the latency will be on the order of seconds, since the model needs to be loaded before it can be run. If you rerun the query, it should complete much faster (in hundreds of milliseconds). INFaaS uses resnet50_tensorflow-cpu_4 to service this query since it is sufficient to the latency requirements.
Now, let’s send a query with a stricter latency requirement:
./infaas_online_query -d 224 -i ../../data/mug_224.jpg -a resnet50 -l 50
Again, the first time you run this query, the latency will be on the order of seconds (or you may even get a Deadline Exceeded message if it’s longer than 10 seconds). Inferentia models can take longer to load and set up, which INFaaS accounts for in its scaling algorithm. If you rerun the query, it should complete in milliseconds. INFaaS uses resnet50_inferentia_1_1 to service this query, since, despite being loaded, resnet50_tensorflow-cpu_4 cannot meet the performance requirements you specified.
Finally, let’s send a batch-2 query with a strict latency requirements:
./infaas_online_query -d 224 -i ../../data/mug_224.jpg -i ../../data/mug_224.jpg -a resnet50 -l 20
Again, the first time you run this query, the latency will be on the order of seconds (or you may even get a Deadline Exceeded message if it’s longer than 10 seconds). Similar to Inferentia models, GPU models take longer to load and set up. If you rerun the query, it should complete in milliseconds. INFaaS uses resnet_v1_50_4 to service this query, since (a) resnet50_tensorflow-cpu_4 supports the batch size but not the latency requirement, and (b) resnet50_inferentia_1_1 only supports batch-1 and cannot meet the latency requirement.
You can also simply specify a use-case to INFaaS with a latency and accuracy requirement. For example:
./infaas_online_query -d 224 -i ../../data/mug_224.jpg -t classification -D imagenet -A 70 -l 50
Update the following two parameters in shutdown_infaas.sh
:
REGION='<REGION>'
ZONE='<ZONE>'
Then, run ./shutdown_infaas.sh
.
You will be prompted on whether you would like to delete or shut down existing worker nodes.
Once this completes, all running INFaaS processes will be shut down on the master, in addition to workers being shut down or deleted (depending on what you inputted).
To file a bug, ask a question, or request a feature, please file a GitHub issue. Pull requests are welcome.
For details about INFaaS, please refer to the following two papers. We kindly ask that you cite them should you use INFaaS in your work.