Senior AI Researcher, Software Engineer, UofT'20
I’m a senior AI Reseacher focusing on developing data-efficient methods for large transformer-based models at Huawei Technologies R&D, Canada.
My interests include software design and development, artificial intelligence, machine learning, computer vision, natural language processing, and computer architecture.
M.A.Sc., Electrical and Computer Engineering, UNIVERSITY OF TORONTO
Chair of Embedded Security, RUHR UNIVERSITY BOCHUM
B.Sc., Electrical and Electronics Engineering, GERMAN UNIVERSITY IN CAIRO
Below are some of my skills, and I'm always looking to learn more.
Extensive research optimizing CNN, RNN, and transformer-based models targetting CV, NLP, and Speech applications. Research work includes model compression techniques (e.g., low-rank tensor decomposition, and layer truncation), knowledge distillation, data-efficient training (e.g., data sampling, and token dropping) of large transformer-based models. Models I've worked on include BERT, CPM, GPT-2/3, ViT, Linformer, Conformer, and Swin-transformer.
In-depth experience with PyTorch and TensorFlow frameworks through numerous R&D projects.
Experience with Python for programming and scripting.
In-depth experience with C/C++ programming, efficient implementation of data structures and algorithms, and the use of the STL library. Projects I've worked on includes building a cycle-accurate simulator for custom ML hardware accelerators called DNNsim.
Hands-on experience with GPU programming using CUDA and code optimization through various projects including building a Compressed-Memory Sprase DNN Inference Accelerator on GPU .
Hands-on experience with Perl, TCL, Bash scripting through various FPGA and ASIC design projects.
In-depth experience with FPGA and ASIC Design using Verilog, SystemVerilog, and VHDL including designing a custom novel ML training hardware accelerator called FPRaker (published in MICRO'21).
Hands-on experience with DevOps tools such as git, docker, conda, JIRA through various projects.
Hands-on experience with the different ASIC design tools such as Intel Quartus Prime, Xilinx ISE, Synopsys Design Compiler, HSPICE & HSIM,Cadence SoC En- counter, Innovus & Virtuoso.
Here you can see some of the open-source projects I've done on my own time.
In my free time, I continue to work on personal projects and have many ideas just waiting to be realized.
Winner of Huawei Quarterly Outstanding Contribution to Project Award. [October 2020]
University of Toronto Edward S. Rogers Sr. Graduate Scholarship for 2 years. [2019 & 2020]
Ruhr University Bochum Undergraduate Research Award for 1 year. [2017]
German University in Cairo High School Excellence Scholarship for 5 years. [2013-2018]
Omar Mohamed Awad. 2019
F. Ataiefard, W. Ahmed, H. Hajimolahoseini, S. Asani, F. Javadi, M. Hassanpour, O. Mohamed Awad, A. Wen, K. Liu, Y. Liu
Association for the Advancement of Artificial Intelligence (AAAI 2024)
H. Hajimolahoseini, O. Mohamed Awad, W. Ahmed, A. Wen, S. Asani, M. Hassanpour, F. Javadi, M. Ahmadi, F. Ataiefard, K. Liu, Y. Liu
37th Conference on Neural Information Processing Systems (NeurIPS 2023)
F. Javadi, W. Ahmed, H. Hajimolahoseini, F. Ataiefard, M. Hassanpour, S. Asani, A. Wen, O. Mohamed Awad, K. Liu, Y. Liu
37th Conference on Neural Information Processing Systems (NeurIPS 2023)
M. Elgammal, O. Mohamed Awad, I. Edo, A. Moshovos, V. Betz
HEART ’23 : Proceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies
H. Hajimolahoseini, M. Rezagholizadeh, V. Partovinia, M. Tahaei, O. Mohamed Awad, Y. Liu
35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks · Dec 6, 2021
O. Mohamed Awad, H. Hajimolahoseini, M. Lim, G. Gosal, W. Ahmed, Y. Liu , G. Deng
Hardware Aware Efficient Training (HAET) at ICLR 2021
(Winner paper of the Hardware Aware Efficient Training competition at ICLR 2021)
O. Mohamed Awad, M. Mahmoud, I. Edo, A. Hadi Zadeh, C. Bannon, A. Moshovos
54th IEEE/ACM International Symposium on Micro-architecture (MICRO), 2021. [Acceptance Rate : 21%]
A. Hadi Zadeh, I. Edo, O. Mohamed Awad, A. Moshovos
53rd IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020. [Acceptance Rate : 19%]
M. Mahmoud, I. Edo, A. Hadi Zadeh, O. Mohamed Awad, J. Albericio, A. Moshovos
53rd IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020. [Acceptance Rate : 19%]
A. Delmás, S. Sharify, I. Edo, D. Malone Stuart, O. Mohamed Awad, P. Judd, M. Mahmoud, M. Nikolic, K. Siu, Z. Poulos, and A. Moshovos
52nd IEEE/ACM International Symposium on Microarchitecture (MICRO), 2019. [Acceptance Rate : 23%]
C. Kison, O. Mohamed Awad, M. Fyrbiak, C. Paar
IEEE Transactions on Information Forensics and Security, 2019
A short summary of my work experience..
- Researching data-efficient training algorithms that provide significant time-to-accuracy savings in pretraining large transformer-based models. - Working on a data-efficient training library that provides various out-of-the-box algorithms to accelerate the training of any arbitrary transformer-based model. Evaluated models include BERT, RoBERTa, Conformer, T5, GPT-3, LLaMA, BLOOM, and ViT.
- Analysis/debugging/tuning of end-to-end performance (starting from model implementation in TensorFlow/PyTorch all down to microcode running on chip) of deep learning models on the Cerebras CS-2 Wafer.
- Performance modeling/projection of upcoming models (eg. Vision Transformer, Linformer) and kernels (e.g. attention) to be supported
- Optimize the training performance of various state-of-the-art NLP models (BERT, CPM, GPT-2/3) on
Huawei’s Ascend910 AI training server.
- Kernels development and performance optimization for Huawei’s Ascend910 AI training server.
- Researching model compression techniques, e.g., low-rank tensor decomposition and layer trunca-
tion.
- Researching knowledge distillation techniques to improve accuracy of compressed models.
- Design of a neural network training accelerator based on a novel processing element architecture
that exploits fine-grain unstructured sparsity to increase the performance and energy efficiency of
the training process by 1.47× and 1.39×, respectively on average over the studied models.
- Development of a custom cycle-accurate trace-based simulator (C/C++) to model the execution time
and memory access of the proposed accelerator compared to a baseline value-agnostic accelerator.
- Exploiting the narrow floating-point value distribution during training through exponent base-delta
encoding compression to save off-chip memory bandwidth by 30% on average.