
TL;DR
- The AI development process is based on two fundamental components: base models and training data.
- These components are very expensive to create from scratch and require a very high amount of resources, hence making open-source repositories the main and sometimes only option for companies wishing to adopt AI in a cost-effective manner.
- Using models from open-source repositories exposes you to supply chain attacks, and could lead to arbitrary code execution, sensitive data exfiltration, or other unauthorized actions in your environment.
- Model scanning detects compromised AI assets from open-source repositories and mitigates the risks mentioned above.
- To gain full confidence in your AI components, model scanning should be embedded in the main stages of your CI/CD pipelines.
Open Source Dependencies In The AI Life-Cycle
A Brief Introduction To The AI Development Process
The AI development and deployment process consists of three main stages -
- Choosing the foundational components for your project.
- Performing training or fine-tuning in the most cost-effective manner.
- Serving and deploying your model for real-time usage.
The first stage is the one we are going to dive into. Unlike other development processes, AI product development heavily relies on open-source components, which form the foundation of the final product. Those components are mostly base models - models which were pre-trained to perform certain tasks and which you can fine-tune in order to match a more specific use case.
Why Use Open-Source Components?
Creating a well-adjusted model from scratch for a specific need is highly challenging.
Training a model from scratch requires deep expertise in data science and machine learning, as well as extensive resources, including large datasets and computational power for iterative training.
For those reasons, companies that are not “AI-natives” like OpenAI or tech giants such as AWS or Meta will often turn to open-source repositories for AI assets and use pre-prepared models and datasets that only require a small amount of extra fine-tuning in order to match the company’s use case.
The Risks of Open-Source AI
Using AI assets from open-source repositories has advantages but also poses risks, including exposure to supply chain attacks.
Like code packages and other resources, AI models and training datasets which are uploaded to the internet for public consumption can be compromised or poisoned to contain vulnerabilities, backdoors or malicious code. For example, model files might include dangerous code libraries allowing arbitrary code execution.
This is especially relevant for model files, as many popular formats (Torch, Joblib, sklearn, etc.) use variations of the Pickle format, which is known to allow arbitrary code execution.
Model scanning
Mitigating the risks of open-source AI while enabling adoption and productivity requires a security solution—this is where model scanning comes in. Model scanning is an umbrella term for products designed to detect compromises made to a model in order to run arbitrary code, exfiltrate sensitive data or perform any other unauthorized action as part of the model’s loading, training or inference.
To gain full confidence that none of your AI assets have been compromised, I believe that a model scanner should be used both manually and automatically as part of a company’s CI/CD process, in two key steps of the AI life-cycle:
- While attempting to download a model from an open-source repository, before any action with the model takes place in your environment (model loading, training etc.).
- Before deploying the model and making it accessible to the desired users.

Summary
As the world of AI continues to grow and becomes ever more relevant, the choice to adopt AI into business processes becomes more and more trivial. Using open-source components as part of your AI development and deployment process is unavoidable - but shouldn’t be irresponsible. Implement model scanning in order to ensure your environment and your AI apps remain safe.