Foundational and Use-Inspired AI Research Intiative
The research intiative focuses on research aimed at enhancing healthcare and security of national defense and leveraging cloud computing. Built on PIs’ prior and ongoing research with AI integration, research themes listed here address multiple aspects of emerging fields that go along with two societal drivers: "Enhancing Healthcare and Quality of Life" and "Transform National Defense and Security". They are also in line with the encouraged collaboration and partnerships with commercial cloud computing platforms for federally funded AI Research and Development (R&D) community, aiming to advance AI research and education.
Trustworty Representation Learning for Medical Imaging Report Generation
This work aims to enhance the reliability of automatic medical imaging report generation by developing a robust model that learns a more dependable feature space. To accomplish this goal, we propose utilizing contrastive learning to learn the latent space embeddings, where similar images are pushed closer together and dissimilar images farther apart. By understanding the relationships between training samples, the model generates a more stable feature embedding, which can then be fed to a language model for report generation, enhancing robustness and producing more trustworthy results. The MIMIC-CXR [14], a large dataset containing over 300K chest X-ray images with the corresponding imaging report, will be used in this study. The contrastive learning model is implemented by a global cross-modal matching approach with a two-branch network that processes text and images separately.
Secure Code Analysis Using AI
This research aims to utilize AI techniques in Static Code Analysis to detect vulnerabilities in code by employing static analysis, ML and Taint Track Visualization. The adoption of ML in reducing false positives and false negative scan (false warnings) empower developers to enhance vulnerability detection accuracy, focusing on genuine vulnerabilities. However, training ML models for secure code analysis faces challenges that require solutions for:
- Limited datasets: ML models need large datasets of code. A vast dataset of open-source functions, carefully labeled with findings, can supplement existing vulnerability datasets.
- Need for new supervised models: Developing new models can capture new types of vulnerabilities. Deep feature representation learning on source code also shows promise, but it requires significant computational resources [23, 24].
- Tainted datasets: Ensuring datasets used for training are untainted is important, and dataset integrity is necessary. Searching and downloading datasets from platforms like GitHub can introduce tainted data.
- False Warnings: Static analysis tools produce false positives and false negatives.
Efficient Cross-Language Malware Detection for Cloud Platforms
This research proposes an efficient malware scanner to dramatically improve the detection coverage, which can be deployed as a module in the cloud platform to scan any uploaded software before running it. Based on our preliminary work about graph modeling [8], we propose to seek a code similarity-based method as it can achieve high detection coverage [9-14]. We plan to achieve our goal through three proposed research tasks. Figure 4 shows the three research tasks we proposed.
Task 1: Embed Code to Graph with ECG–A Language Agnostic Code Representation. The major challenge of cross-language code similarity detection is the different syntax and semantics that exist between various languages. This task proposes a language-agnostic code representation, embedded control flow graph (ECG), to unify them into the same representation.
Task 2: Efficient Code Similarity Detection for Embedded Control Flow Graph. After Task 1, the problem of code similarity has become ECG similarity. To efficiently compute that, this task proposes to leverage the graph triplet-loss network to compute the similarity between ECGs.
Task 3: Scalable Search Algorithm for Malware Detection. After Tasks 1 & 2, each code is represented as an embedding. Then, we can compute the similarity of two codes as an embedding similarity. Given an unknown code, we can compare it with the pre-built malware database (DB). It is highly suspicious if the given unknown code is similar to known malware. To this end, this task aims to design a scalable search method for malware detection, proposing a locality-sensitive hashing-based scalable search method.
Performance Analysis on ML Model Deployment with Large-Scale Data on Cloud
The goal of this study is two fold: (1) To conduct a comprehensive analysis of performance in optimizing the training and prediction of ML models with large-scale datasets (e.g., real-time transactional & analytical data, medical images, and/or datasets indicated in the above research themes 1 & 3) on cloud with two approaches being considered: (a) Executing built-in query functions directly from a database to an AI platform, (b) Implementing the same tasks, but within a typical AI-ML platform, and (2) To analyze the relationship between performance and cost, while investigating any tradeoffs.
The study conducts performance evaluation and analysis with two pipelines and examines their relationship with cost. Pipeline 1: Execute built-in functions directly with the dataset in the database to train a model and deploy it to the AI endpoint. To achieve throughput improvement, the tests will involve sending multiple rows in a single request by grouping them in batches. Pipeline 2: Programmatically transfer the same data to a Cloud Storage within the AI platform and do the same tasks in the deployment. Each run for both pipelines will be performed using various data types (e.g., image, video, & text), and different batch sizes will be employed.