Omar Mohamed Awad

Portfolio

Personal Projects

Here you can see some of the open-source projects I've done on my own time.

Compressed-Memory Sprase DNN Inference Accelerator on GPU

Improving ResNet-9 Generalization Trained on Small Datasets

Winner of the “Hardware Aware Efficient Training” competition at ICLR 2021

5-Stages Pipelined MIPS Processor

Implementation of a MIPS processor with AES encrypted memory in HLS

In my free time, I continue to work on personal projects and have many ideas just waiting to be realized.

To see more of my projects...

Visit My GitHub

Awards

Winner of the Data Application Acceleration Lab’s Individual Award of Q1, 2023. [May 2023]

Winner of the “Hardware Aware Efficient Training” competition at ICLR 2021. [May 2021]

Winner of Huawei Quarterly Outstanding Contribution to Project Award. [October 2020]

University of Toronto Edward S. Rogers Sr. Graduate Scholarship for 2 years. [2019 & 2020]

Ruhr University Bochum Undergraduate Research Award for 1 year. [2017]

German University in Cairo High School Excellence Scholarship for 5 years. [2013-2018]

Papers

Master's thesis:

FPRaker: Exploiting Fine-Grain Sparsity to Accelerate Neural Network Training

Omar Mohamed Awad. 2019

Bitpruning : Learning bitlengths for aggressive and accurate quantization

M. Nikolić, G. Hacene, C. Bannon, A. Lascorz, M. Courbariaux, O. Mohamed Awad, I. Vivancos, Y. Bengio, V. Gripon, A. Moshovos

2024 IEEE International Symposium on Circuits and Systems (ISCAS)

SkipViT : Speeding Up Vision Transformers with a Token-Level Skip Connection

F. Ataiefard, W. Ahmed, H. Hajimolahoseini, S. Asani, F. Javadi, M. Hassanpour, O. Mohamed Awad, A. Wen, K. Liu, Y. Liu

Association for the Advancement of Artificial Intelligence (AAAI 2024)

SwiftLearn : A Data-Efficient Training Method of Deep Learning Models using Importance Sampling

H. Hajimolahoseini, O. Mohamed Awad, W. Ahmed, A. Wen, S. Asani, M. Hassanpour, F. Javadi, M. Ahmadi, F. Ataiefard, K. Liu, Y. Liu

37th Conference on Neural Information Processing Systems (NeurIPS 2023)

“GQKVA : Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

F. Javadi, W. Ahmed, H. Hajimolahoseini, F. Ataiefard, M. Hassanpour, S. Asani, A. Wen, O. Mohamed Awad, K. Liu, Y. Liu

37th Conference on Neural Information Processing Systems (NeurIPS 2023)

cuSCNN : an Efficient CUDA Implementation of Sparse CNNs

M. Elgammal, O. Mohamed Awad, I. Edo, A. Moshovos, V. Betz

HEART ’23 : Proceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies

Compressing Pre-trained Language Models using Progressive Low Rank Decomposition

H. Hajimolahoseini, M. Rezagholizadeh, V. Partovinia, M. Tahaei, O. Mohamed Awad, Y. Liu

35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks · Dec 6, 2021

Improving ResNet-9 Generalization Trained on Small Datasets

O. Mohamed Awad, H. Hajimolahoseini, M. Lim, G. Gosal, W. Ahmed, Y. Liu , G. Deng

Hardware Aware Efficient Training (HAET) at ICLR 2021
(Winner paper of the Hardware Aware Efficient Training competition at ICLR 2021)

FPRaker: A Processing Element for Accelerating Neural Network Training

O. Mohamed Awad, M. Mahmoud, I. Edo, A. Hadi Zadeh, C. Bannon, A. Moshovos

54th IEEE/ACM International Symposium on Micro-architecture (MICRO), 2021. [Acceptance Rate : 21%]

GOBO : Quantizing Attention-Based NLP Modelsfor Low Latency and Energy Efficient Inference”

A. Hadi Zadeh, I. Edo, O. Mohamed Awad, A. Moshovos

53rd IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020. [Acceptance Rate : 19%]

TensorDash : Exploiting Sparsity to Accelerate Neural Network Training”

M. Mahmoud, I. Edo, A. Hadi Zadeh, O. Mohamed Awad, J. Albericio, A. Moshovos

53rd IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020. [Acceptance Rate : 19%]

ShapeShifter : Enabling Fine-Grain Data Width Adaptation in Deep Learning”

A. Delmás, S. Sharify, I. Edo, D. Malone Stuart, O. Mohamed Awad, P. Judd, M. Mahmoud, M. Nikolic, K. Siu, Z. Poulos, and A. Moshovos

52nd IEEE/ACM International Symposium on Microarchitecture (MICRO), 2019. [Acceptance Rate : 23%]

“Security Implications of Intentional Capacitive Crosstalk”

C. Kison, O. Mohamed Awad, M. Fyrbiak, C. Paar

IEEE Transactions on Information Forensics and Security, 2019

Timeline

A short summary of my work experience..

Oct 2022 - Present

Senior ML Research Engineer

Huawei Technologies Canada Co., Ltd., Toronto, Canada

- Mentoring new hires in the team during their 3-month ramp-up program (setting goals, technical/non-technical mentorship, review/evaluate progress)
- Interviewing candidates for our internal SW/HW co-design project (exploring the interplay between ML arch/training algos with HW to influence next release of Huawei's custom ML accelerator).
- Designing efficient Transformer + SSM Hybrid LLMs.
- Leading a HW/SW-codesign project on sub-8-bits training and inference of LLMs.
- Training Memory-augmented LLMs to support long/unlimited context.
- Researching data-efficient training algorithms that provide significant time-to-accuracy savings in pretraining LLMs.
- Working on a data-efficient training library that provides various out-of-the-box algorithms to accelerate the training of any arbitrary LLM.
- Main contributor of 2 internal patents (one for a novel dataset sampling method to accelerate model training, and another novel method to accelerate MoE transformer models).
- Winner of the Data Application Acceleration Lab's Individual Award of Q1, 2023.
Aug 2021 - Oct 2022

Member of Technical Staff - Deep Learning Performance

Cerebras Systems, Toronto, Canada

- Analysis/debugging/tuning of end-to-end performance (starting from model implementation in TensorFlow/PyTorch all down to microcode running on chip) of deep learning models on the Cerebras CS-2 Wafer.
- Performance modeling/projection of upcoming models (eg. Vision Transformer, Linformer) and kernels (e.g. attention) to be supported
Aug 2020 - Aug 2021

Machine Learning Research Engineer

Huawei Technologies Canada Co., Ltd., Toronto, Canada

- Optimize the training performance of various state-of-the-art NLP models (BERT, CPM, GPT-2/3) on Huawei’s Ascend910 AI training server.
- Kernels development and performance optimization for Huawei’s Ascend910 AI training server.
- Researching model compression techniques, e.g., low-rank tensor decomposition and layer trunca- tion.
- Researching knowledge distillation techniques to improve accuracy of compressed models.
Sep 2018 - Jul 2020

Graduate Research Assistant

University of Toronto, Toronto, Canada

- Design of a neural network training accelerator based on a novel processing element architecture that exploits fine-grain unstructured sparsity to increase the performance and energy efficiency of the training process by 1.47× and 1.39×, respectively on average over the studied models.
- Development of a custom cycle-accurate trace-based simulator (C/C++) to model the execution time and memory access of the proposed accelerator compared to a baseline value-agnostic accelerator.
- Exploiting the narrow floating-point value distribution during training through exponent base-delta encoding compression to save off-chip memory bandwidth by 30% on average.

Contact Me

Email: omar.mo.awad@outlook.com

or send me a message over LinkedIn!

Compiler/Transpiler

Parses and compiles a custom programming language into C++.

I've created my own C-based scripting language. While it strongly resembles other languages, it has features that others alone do not have. It includes Java's static type checking, C++'s primitive types, Python's binary operations and nested functions, and JavaScript's expression evaluation and objects.

It is designed to overcome some of the more annoying aspects of C++, such as its inability to be used as a scripting language and resolve functions defined later in the file or nested within other functions. Simultaneously, it retains many of the great features of C++ as well as features from other languages like Python and JavaScript, all while keeping the speed C++ is known for. This is possible because the compiler uses ANTLR and a context-free grammar to parse the code and then transpile it directly into C++.

I'm still working on this project and looking to add more features. Below is a brief example of a the language implementation which demonstrates some of the features of the language so far, followed by the abstract syntax tree generated by the parser.


print(test(0));                           // Use as scripting language

unsigned short test(int n) {              // unsigned short return type
    long double x = 2;
    if (n) {                              // if n != 0
        return (unsigned short) x ** n;   // x to the power of n
    }
    return tryAgain();

    unsigned short tryAgain() {           // Nested function
        return test(n + 1);
    }
}

Below are two more examples: Binary search and the Ackermann function.


// Binary Search
int binarySearch(int[] arr, int value, int left, int right) {
    while (left <= right) {
        int middle = (left + right) / 2;
        if (arr[middle] == value)
            return middle;
        else if (arr[middle] > value)
            right = middle - 1;
        else
            left = middle + 1;
    }
    return -1;
}


// Ackermann Function
int ackermann(int m, int n) {
    if (m == 0) return n + 1;
    if (n == 0) return ackermann(m - 1, 1);

    return ackermann(m - 1, ackermann(m, n - 1));
}

View this project on GitHub: https://github.com/ewadkins/Compiler

Close Project

Learn more about my:

Skills

Deep Learning

TensorFlow/PyTorch

Python

C/C++

GPU programming (CUDA)

Perl, TCL, Bash

Verilog/SystemVerilog/VHDL

DevOps Tools (Git, Docker, Conda, JIRA)

ASIC Design

Portfolio

Personal Projects

Compressed-Memory Sprase DNN Inference Accelerator on GPU

Improving ResNet-9 Generalization Trained on Small Datasets

5-Stages Pipelined MIPS Processor

To see more of my projects...

Awards

Papers

Master's thesis:

FPRaker: Exploiting Fine-Grain Sparsity to Accelerate Neural Network Training

Bitpruning : Learning bitlengths for aggressive and accurate quantization

SkipViT : Speeding Up Vision Transformers with a Token-Level Skip Connection

SwiftLearn : A Data-Efficient Training Method of Deep Learning Models using Importance Sampling

“GQKVA : Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

cuSCNN : an Efficient CUDA Implementation of Sparse CNNs

Compressing Pre-trained Language Models using Progressive Low Rank Decomposition

Improving ResNet-9 Generalization Trained on Small Datasets

FPRaker: A Processing Element for Accelerating Neural Network Training

GOBO : Quantizing Attention-Based NLP Modelsfor Low Latency and Energy Efficient Inference”

TensorDash : Exploiting Sparsity to Accelerate Neural Network Training”

ShapeShifter : Enabling Fine-Grain Data Width Adaptation in Deep Learning”

“Security Implications of Intentional Capacitive Crosstalk”

Timeline

Oct 2022 - Present

Senior ML Research Engineer

Huawei Technologies Canada Co., Ltd., Toronto, Canada

Aug 2021 - Oct 2022

Member of Technical Staff - Deep Learning Performance

Cerebras Systems, Toronto, Canada

Aug 2020 - Aug 2021

Machine Learning Research Engineer

Huawei Technologies Canada Co., Ltd., Toronto, Canada

Sep 2018 - Jul 2020

Graduate Research Assistant

University of Toronto, Toronto, Canada

Contact Me

Email: omar.mo.awad@outlook.com

or send me a message over LinkedIn!