Omar Mohamed Awad

Senior ML Research Engineer @ Huawei | LLM, Generative AI | ex-Cerebras
Enter Portfolio Book an Appointment
Omar Mohamed Awad

Omar Mohamed Awad

Senior AI Researcher, Software Engineer, UofT'20

I’m a senior AI Reseacher focusing on developing data-efficient methods for large transformer-based models at Huawei Technologies R&D, Canada.
My interests include software design and development, artificial intelligence, machine learning, computer vision, natural language processing, and computer architecture.

M.A.Sc., Electrical and Computer Engineering, UNIVERSITY OF TORONTO

  • Thesis Topic: Exploiting Fine-Grain Sparsity to Accelerate Neural Network Training.
  • Supervisor: Prof. Andreas Moshovos

Chair of Embedded Security, RUHR UNIVERSITY BOCHUM

  • Study Abroad

B.Sc., Electrical and Electronics Engineering, GERMAN UNIVERSITY IN CAIRO

  • Thesis Topic: Implementation of Hardware Trojans in ASIC Chips based on Routing Capacitive Cross- talk.
  • Supervisor: Prof. Christof Paar - Ruhr University Bochum, Germany.

Skills

Below are some of my skills, and I'm always looking to learn more.

Deep Learning

Extensive research optimizing CNN, RNN, and transformer-based models targetting CV, NLP, and Speech applications. Research work includes model compression techniques (e.g., low-rank tensor decomposition, and layer truncation), knowledge distillation, data-efficient training (e.g., data sampling, and token dropping) of large transformer-based models. Models I've worked on include BERT, CPM, GPT-2/3, ViT, Linformer, Conformer, and Swin-transformer.

TensorFlow/PyTorch

In-depth experience with PyTorch and TensorFlow frameworks through numerous R&D projects.

Python

Experience with Python for programming and scripting.

C/C++

In-depth experience with C/C++ programming, efficient implementation of data structures and algorithms, and the use of the STL library. Projects I've worked on includes building a cycle-accurate simulator for custom ML hardware accelerators called DNNsim.

GPU programming (CUDA)

Hands-on experience with GPU programming using CUDA and code optimization through various projects including building a Compressed-Memory Sprase DNN Inference Accelerator on GPU .

Perl, TCL, Bash

Hands-on experience with Perl, TCL, Bash scripting through various FPGA and ASIC design projects.

Verilog/SystemVerilog/VHDL

In-depth experience with FPGA and ASIC Design using Verilog, SystemVerilog, and VHDL including designing a custom novel ML training hardware accelerator called FPRaker (published in MICRO'21).

DevOps Tools (Git, Docker, Conda, JIRA)

Hands-on experience with DevOps tools such as git, docker, conda, JIRA through various projects.

ASIC Design

Hands-on experience with the different ASIC design tools such as Intel Quartus Prime, Xilinx ISE, Synopsys Design Compiler, HSPICE & HSIM,Cadence SoC En- counter, Innovus & Virtuoso.

ƒp

Portfolio

Personal Projects

Here you can see some of the open-source projects I've done on my own time.

In my free time, I continue to work on personal projects and have many ideas just waiting to be realized.

To see more of my projects...

Visit My GitHub  

Awards


Winner of Huawei Quarterly Outstanding Contribution to Project Award. [October 2020]

University of Toronto Edward S. Rogers Sr. Graduate Scholarship for 2 years. [2019 & 2020]

Ruhr University Bochum Undergraduate Research Award for 1 year. [2017]

German University in Cairo High School Excellence Scholarship for 5 years. [2013-2018]

Papers


SkipViT : Speeding Up Vision Transformers with a Token-Level Skip Connection

F. Ataiefard, W. Ahmed, H. Hajimolahoseini, S. Asani, F. Javadi, M. Hassanpour, O. Mohamed Awad, A. Wen, K. Liu, Y. Liu

Association for the Advancement of Artificial Intelligence (AAAI 2024)

SwiftLearn : A Data-Efficient Training Method of Deep Learning Models using Importance Sampling

H. Hajimolahoseini, O. Mohamed Awad, W. Ahmed, A. Wen, S. Asani, M. Hassanpour, F. Javadi, M. Ahmadi, F. Ataiefard, K. Liu, Y. Liu

37th Conference on Neural Information Processing Systems (NeurIPS 2023)

“GQKVA : Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

F. Javadi, W. Ahmed, H. Hajimolahoseini, F. Ataiefard, M. Hassanpour, S. Asani, A. Wen, O. Mohamed Awad, K. Liu, Y. Liu

37th Conference on Neural Information Processing Systems (NeurIPS 2023)

cuSCNN : an Efficient CUDA Implementation of Sparse CNNs

M. Elgammal, O. Mohamed Awad, I. Edo, A. Moshovos, V. Betz

HEART ’23 : Proceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies

Compressing Pre-trained Language Models using Progressive Low Rank Decomposition

H. Hajimolahoseini, M. Rezagholizadeh, V. Partovinia, M. Tahaei, O. Mohamed Awad, Y. Liu

35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks · Dec 6, 2021

Improving ResNet-9 Generalization Trained on Small Datasets

O. Mohamed Awad, H. Hajimolahoseini, M. Lim, G. Gosal, W. Ahmed, Y. Liu , G. Deng

Hardware Aware Efficient Training (HAET) at ICLR 2021
(Winner paper of the Hardware Aware Efficient Training competition at ICLR 2021)

FPRaker: A Processing Element for Accelerating Neural Network Training

O. Mohamed Awad, M. Mahmoud, I. Edo, A. Hadi Zadeh, C. Bannon, A. Moshovos

54th IEEE/ACM International Symposium on Micro-architecture (MICRO), 2021. [Acceptance Rate : 21%]

GOBO : Quantizing Attention-Based NLP Modelsfor Low Latency and Energy Efficient Inference”

A. Hadi Zadeh, I. Edo, O. Mohamed Awad, A. Moshovos

53rd IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020. [Acceptance Rate : 19%]

TensorDash : Exploiting Sparsity to Accelerate Neural Network Training”

M. Mahmoud, I. Edo, A. Hadi Zadeh, O. Mohamed Awad, J. Albericio, A. Moshovos

53rd IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020. [Acceptance Rate : 19%]

ShapeShifter : Enabling Fine-Grain Data Width Adaptation in Deep Learning”

A. Delmás, S. Sharify, I. Edo, D. Malone Stuart, O. Mohamed Awad, P. Judd, M. Mahmoud, M. Nikolic, K. Siu, Z. Poulos, and A. Moshovos

52nd IEEE/ACM International Symposium on Microarchitecture (MICRO), 2019. [Acceptance Rate : 23%]

“Security Implications of Intentional Capacitive Crosstalk”

C. Kison, O. Mohamed Awad, M. Fyrbiak, C. Paar

IEEE Transactions on Information Forensics and Security, 2019

Timeline

A short summary of my work experience..

  • Huawei
    Oct 2022 - Present
    Senior ML Research Engineer
    Huawei Technologies Canada Co., Ltd., Toronto, Canada

    - Researching data-efficient training algorithms that provide significant time-to-accuracy savings in pretraining large transformer-based models.
    - Working on a data-efficient training library that provides various out-of-the-box algorithms to accelerate the training of any arbitrary transformer-based model. Evaluated models include BERT, RoBERTa, Conformer, T5, GPT-3, LLaMA, BLOOM, and ViT.

  • Cerebras
    Aug 2021 - Oct 2022
    Member of Technical Staff - Deep Learning Performance
    Cerebras Systems, Toronto, Canada

    - Analysis/debugging/tuning of end-to-end performance (starting from model implementation in TensorFlow/PyTorch all down to microcode running on chip) of deep learning models on the Cerebras CS-2 Wafer.
    - Performance modeling/projection of upcoming models (eg. Vision Transformer, Linformer) and kernels (e.g. attention) to be supported

  • Huawei
    Aug 2020 - Aug 2021
    Machine Learning Research Engineer
    Huawei Technologies Canada Co., Ltd., Toronto, Canada

    - Optimize the training performance of various state-of-the-art NLP models (BERT, CPM, GPT-2/3) on Huawei’s Ascend910 AI training server.
    - Kernels development and performance optimization for Huawei’s Ascend910 AI training server.
    - Researching model compression techniques, e.g., low-rank tensor decomposition and layer trunca- tion.
    - Researching knowledge distillation techniques to improve accuracy of compressed models.

  • UofT2
    Sep 2018 - Jul 2020
    Graduate Research Assistant
    University of Toronto, Toronto, Canada

    - Design of a neural network training accelerator based on a novel processing element architecture that exploits fine-grain unstructured sparsity to increase the performance and energy efficiency of the training process by 1.47× and 1.39×, respectively on average over the studied models.
    - Development of a custom cycle-accurate trace-based simulator (C/C++) to model the execution time and memory access of the proposed accelerator compared to a baseline value-agnostic accelerator.
    - Exploiting the narrow floating-point value distribution during training through exponent base-delta encoding compression to save off-chip memory bandwidth by 30% on average.

Contact Me

Email: omar.mo.awad@outlook.com

or send me a message over LinkedIn!