Faculty Profiles

Colleges and Departments

Dr. Apan Muhammad Qasem

Professor at Computer Science, College of Science & Engineering

Scholarly and Creative Works

2025

Sadman, Z., & Qasem, A. M. (2025). LLMs in Compiler Optimization: Challenges and Future Direction. IEEE Pulse, 16, 35–37. https://doi.org/10.1109/MPULS.2025.3526517
Vyas, S., Vyas, C., Sharotry, A., Jimenez, J., Qasem, A. M., & Mendez, F. A. (2025). Human Activity Recognition in MMH: Pilot Study on Lifting and Lowering Classification.
Zisan, S. A., Nooruddin, M., & Qasem, A. M. (2025). A Multi-Tiered Autotuner for Portable Heterogeneous-Compute Interfaces (pp. 73–82). https://doi.org/10.1109/COMPSAC65507.2025.00018
Hanz, T. R., Qasem, A. M., Ali, M., & Sadman, Z. (2025). Autotuning CNN Workloads on the Edge: A Hybrid Approach with Cross-Domain Embeddings (pp. 421–430). https://doi.org/10.1109/COMPSAC65507.2025.00062
Ali, M., Sadman, Z., & Qasem, A. M. (2025). Improving Energy Efficiency of Irregular Workloads with Transformers and Tabular Data Diffusion (pp. 2408–2409). https://doi.org/10.1109/COMPSAC65507.2025.00339
Hanz, T. R., & Qasem, A. M. (2025). Accelerated Autotuning of Deep Learning Workloads with Pretrained Performance Models (pp. 337–344). https://doi.org/10.1109/HPCC67675.2025.00063
Tomasso, M. E., & Qasem, A. M. (2025). Time-Series Analysis of Agent-Based Models: Three Case Studies.
Tomasso, M. E., & Qasem, A. M. (2025). An Ensemble Approach to Creating Surrogates for Macro-Emulation Of Time-Series ABMs.

2024

Ali, M., & Qasem, A. M. (2024). Alleviating dataset constraints through synthetic data generation in machine learning driven power modeling (pp. 52–58).

2023

Novoa Ramirez, C. M., & Qasem, A. M. (2023). GPU-accelerated Parallel Solutions to the Quadratic Assignment Problem. ArXiv, 1–25. Retrieved from https://arxiv.org/abs/2307.11248

2022

Bunde, D., Ahmed, K., Ayloo, S., Brown-Gaines, T., Fuentes, J., Jatala, V., … Yeh, T. (2022). Adopting Heterogeneous Computing Modules: Experiences from a ToUCH Summer Workshop. IEEE.
Qasem, A. M., Ayguade, E., Cahill, K., Ostasz, M., Panda, D. K., & Tomko, K. (2022). Lightning Talks of EduHPC 2022. IEEE.
Rafi, M. E. H., Williams, K. B., & Qasem, A. (2022). Raptor: Detecting CPU-GPU False Sharing Under Unified Memory Systems. https://doi.org/10.1109/IGSC55832.2022.9969376
Rafi, M. E. H., & Qasem, A. M. (2022). Optimal Launch Bound Selection in CPU-GPU Hybrid Graph Applications with Deep Learning. https://doi.org/10.1109/IGSC55832.2022.9969364
Girolamo, J. D., Hope, J., & Qasem, A. (2022). Uncovering Input-Sensitive Energy Bottlenecks in Oversubscribed GPU Workloads. Sustainable Computing: Informatics and Systems, 35. https://doi.org/https://doi.org/10.1016/j.suscom.2022.100654
Qasem, A. M. (2022). YODA: A pedagogical tool for teaching systems concepts (Vol. 1, pp. 613–618).
Qasem, A. M., & Bunde, D. (2022). Heterogeneous computing for undergraduates: Introducing the touch module repository (Vol. 2).

2021

Hope, J., Gjergji, M., DiGirolamo, J., Alvarez, M., & Qasem, A. (2021). Characterizing Input-sensitivity in Tightly-Coupled Collaborative Graph Algorithms (pp. 287–296). https://doi.org/10.1109/CCGrid51090.2021.00038
Ford, B., Qasem, A., Tesic, J., & Zong, Z. (2021). Migrating Software from x86 to ARM Architecture: An Instruction Prediction Approach. Retrieved from 601 University Dr
Ford, B. W., Qasem, A. M., Tesic, J., & Zong, Z. (2021). Migrating Software from x86 to ARM Architecture: An Instruction Prediction Approach. In 2021 IEEE International Conference on Networking, Architecture and Storage (NAS) (pp. 1–6). IEEE. https://doi.org/10.1109/nas51552.2021.9605443
Bunde, D., Schielke, P., & Qasem, A. M. (2021). Short Modules for Introducing Heterogeneous Computing. Journal of Computing Sciences in Colleges, 36(8), 95–96. https://doi.org/10.5555/3470135.3470145
Bunde, D. P., Qasem, A., & Schielke, P. (2021). Short Modules for Introducing Heterogeneous Computing: Workshop.
Bunde, D., Schielke, P., & Qasem, A. M. (2021). Teaching About Heterogeneity (SIGCSE21).
Qasem, A. M., Bunde, D. P., & Schielke, P. (2021). A Module-based Introduction to Heterogeneous Computing in Core Courses. Journal of Parallel and Distributed Computing, 158, 56–66. https://doi.org/10.1016/j.jpdc.2021.07.011

2020

Sultana, T., Allen, B., & Qasem, A. M. (2020). Intelligent Data Placement on Discrete GPU Nodes with Unified Memory (PACT20) (pp. 139–151). New York, NY: ACM. https://doi.org/10.1145/3410463.3414651

2019

Qasem, A. (2019). A Gentle Introduction to Heterogeneous Computing for CS1 Students (pp. 10–16). IEEE. https://doi.org/10.1109/EduHPC49559.2019.00007
Hope, J., Nag, T., & Qasem, A. (2019). Energy-efficient GPU graph processing with on-demand page migration. IEEE Computer Society.
Aslan, S., Kellington, J. W., Sefat, M. S., & Qasem, A. M. (2019). Accelerating HotSpots in Deep Neural Networks on a CAPI-Based FPGA (pp. 248–256). https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00048

2018

Qasem, A. M., Novoa Ramirez, C. M., **Kolla, C. S., & *Coyle, S. (2018). High-Accuracy Scalable Solutions to the Dynamic Facility Layout Problem. Super Computing (SC18) - The International Conference for High Performance Computing, Networking, Storage, and Analysis Proceedings, 1–2. Retrieved from https://sc18.supercomputing.org/proceedings/tech_poster/poster_files/post169s2-file3.pdf
Qasem, A. M. (2018). Modules for Teaching Parallel Performance Concepts. In Topics in Parallel and Distributed Computing Enhancing the Undergraduate Curriculum: Performance, Concurrency, and Programming on Modern Platforms (pp. 59–77). Springer. https://doi.org/https://doi.org/10.1007/978-3-319-93109-8
Sefat, M. S., Aslan, S., & Qasem, A. M. (2018). Hardware Acceleration of CNNs with Coherent FPGAs.
Qasem, A. M. (2018). Modules for Teaching Parallel Performance Concepts. In Topics in Parallel and Distributed Computing: Introducing Concurrency in Undergraduate Courses (Vol. 2). Springer.
Qasem, A. M., Aji, A. M., & Chu, M. L. (2018). Investigating data layout transformations in Chapel (pp. 915–924).

2017

Saha, B. K., Rahman, S., Connors, T., & Qasem, A. M. (2017). A Machine Learning Approach to Automatic Creation of Architecture-sensitive Performance Heuristics.
Connors, T., & Qasem, A. M. (2017). Automatically Selecting Profitable Thread Block Sizes for Accelerated Kernels.
Qasem, A. M., & Teich, S. (2017). Evaluating the Impact of Data Layout and Placement on the Energy Efficiency of Heterogeneous Applications.
Teich, S., & Qasem, A. M. (2017). Mitigating Register Pressure in GPU Kernels for Improved Energy Efficiency.
Qasem, A. M., Aji, A., & Rodgers, G. (2017). Characterizing data organization effects on heterogeneous memory architectures.

2015

Qasem, A. M., Burtscher, M., & Taheri, S. (2015). A tool for automatically suggesting source-code optimizations for complex GPU kernels.
Qasem, A. M., Rahman, S., Burtscher, M., & Zong, Z. (2015). Maximizing hardware prefetch effectiveness with machine learning.
Qasem, A. M., Novoa, C., & Chaparala, A. (2015). A SIMD tabu search implementation for solving the quadratic assignment problem with gpu acceleration.
Qasem, A. M., Burtscher, M., Peng, W., Shi, H., Tamir, D., & Thiry, H. (2015). A module-based approach to adopting the 2013 ACM curricular recommendations on parallel computing.
Novoa Ramirez, C. M., Qasem, A. M., & **Chaparala, A. (2015). A SIMD Tabu Search Implementation for Solving the Quadratic Assignment Problem with GPU Acceleration. In Proceedings of the 2015 Annual Conference on Extreme Science and Engineering Discovery Environment (XSEDE ’15) (pp. 1–8). https://doi.org/doi.acm.org/10.1145/2792745.2792758
Qasem, A. M., Novoa Ramirez, C. M., **Chaparala, A., & *Rishu, N. (2015). Autotuning GPU-accelerated Quadratic Assignment Problems (QAP) Solvers for Power and Performance. In Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications (HPCC) (pp. 1–6).
Qasem, A. M., Gutierrez, M., Rahman, S., & Tamir, D. (2015). Neural network methods for fast and portable prediction of CPU power consumption.
Qasem, A. M., Alvarado, C., & Tamir, D. (2015). Realizing energy-efficient thread affinity configurations with supervised learning.
Qasem, A. M., & Connors, T. (2015). Power-performance analysis of metaheuristic search algorithms on the GPU.
Qasem, A. M., Saha, B. K., & Rahman, S. (2015). MLTUNE: A tool-chain for automating the workflow of machine-learning based performance tuning (extended abstract).
Qasem, A. M., & Rahman, S. (2015). Investigating prefetch potential on the Xeon Phi with autotuning (extended abstract).
Qasem, A. M., Gutierrez, M., & Tamir, D. (2015). Evaluating neural network methods for PMC-based CPU power prediction.
Qasem, A. M., Alvarado, C., & Tamir, D. (2015). Energy-efficient thread migration via dynamic characterization of resource utilization.
Qasem, A. M., Chaparala, A., & Novoa, C. (2015). Autotuning gpu-accelerated qap solvers for power and performance.

2014

Qasem, A. M. (2014). Exposing undergraduates to parallel performance concepts with a three-module sequence.
Qasem, A. M., Shankar, S., Lakomski, G., Alvarado, C., Hay, R., Hyatt, C., & Tamir, D. (2014). Power aware work stealing in homogeneous multicore systems.
Qasem, A. M., Hyatt, C., Lakomski, G., Alvarado, C., Hay, R., & Tamir, D. (2014). Power aware task matching and migration in heterogeneous processing environments.
Qasem, A. M., Alvarado, C., & Tamir, D. (2014). Dynamic feedback-driven thread migration for energy-efficient execution of multithreaded workloads.
Qasem, A. M., Novoa, C., & Chaparala, A. (2014). A SIMD solution for the quadratic assignment problem with GPU acceleration.
**Chaparala, A., Novoa Ramirez, C. M., & Qasem, A. M. (2014). A SIMD Solution for the Quadratic Assignment Problem with GPU Acceleration. In Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment (XSEDE ’14) (pp. 1–8). https://doi.org/doi.acm.org/10.1145/2616498.2616521

2013

Qasem, A. M., & Magee, J. (2013). Improving TLB performance on current chip multiprocessor architectures through demand-driven Superpaging. Software Practice and Experience (SPE), 43(6), 750–729.
Qasem, A. M., Burtscher, M., Peng, W., Shi, H., Tamir, D., & Thiry, H. (2013). Integrating parallel computing into the undergraduate curriculum at Texas State University: Experiences from the first year.
Qasem, A. M., Hyatt, C., Lakomski, G., & Tamir, D. (2013). Power aware task matching and migration in heterogeneous processing environments.
Qasem, A. M., Holt, J., Bazzera, G., Miller, J., & Hoffman, H. (2013). A pattern language for adaptive parallel software.
Qasem, A. M., Shankar, S., & Tamir, D. (2013). Towards an operating system based framework for energy-efficient scheduling of parallel workloads.
Qasem, A. M., Rashid, H., Hay, R., & Novoa, C. (2013). Algorithmic choice in optimization problems: A performance study (extended poster abstract).
Qasem, A. M., Rahman, S., & Hay, R. (2013). Enhancing learning-based autotuning with composite and diagnostic feature vectors (extended poster abstract).
Qasem, A. M., Burtscher, M., Peng, W., Shi, H., & Tamir, D. (2013). Preparing computer science students for an increasingly parallel world: Teaching parallel computing early and often (extended poster abstract).
**Rashid, H., Novoa Ramirez, C. M., Hay, R., & Qasem, A. M. (2013). Algorithmic Choice in Optimization Problems: A Performance Study. Super Computing (SC13) - The International Conference for High Performance Computing, Networking, Storage, and Analysis Proceedings, 1. Retrieved from http://sc13.supercomputing.org/sites/default/files/PostersArchive/tech_posters/post133s2-file3.pdf

2012

Qasem, A. M., & Sarangkar, S. (2012). MATS: A Model-driven Adaptive Tuning System for Parallel Workloads. Journal of Parallel and Cloud Computing (JPCC), 1, 50–64.
Qasem, A. M. (2012). High-Level Language Extensions For Fast Execution Of Pipeline-Parallelized Code On Current Chip Multi-Processor Systems. International Journal of Programming Languages and Applications (IJPLA), 2, 1–12.
Qasem, A. M. (2012). Architectural Considerations for Compiler-guided Unroll-and-Jam of CUDA Kernels. American Journal of Computer Architecture, 1, 12–20.
Qasem, A. M. (2012). Autotuning Strategies For Reducing Synchronization Costs In Multithreaded Kernels. Journal of Systems and Software (JSYS), 2, 152–165.
** Rashid, H., Novoa Ramirez, C. M., McKenney, M., & Qasem, A. M. (2012). Efficient Parallel Solutions to the Integral Knapsack Problem on Current Chip-multiprocessor Systems. International Journal of Parallel, Emergent and Distributed Systems, 27(1), 19–44.
Qasem, A. M., Cade, M. J., & Tamir, D. (2012). Improved Energy Efficiency For Multithreaded Kernels Through Model-Based Autotuning (pp. 1–6).
Qasem, A. M., Unkule, S., & Shaltz, C. (2012). Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality (pp. 21–40).
Qasem, A. M. (2012). Ef?cient Execution of Time-step Computations with Pipelined Parallelism and Inter-thread Data Locality Optimizations (pp. 27–35).
Qasem, A. M., & Tamir, D. (2012). Memory Performance Diagnosis Through Feedback Synthesis (pp. 5–10).
Qasem, A. M., & Chen, S. (2012). Using Macro Features in Learning Algorithms for Optimizing Dense-matrix Computations (Vol. Technical Report CS-TR-2012-21). Department of Computer Science, Texas State University.

2011

Qasem, A. M., Rashid, H., Novoa, C., & McKenney, M. (2011). Efficient Parallel Solutions to the Integral Knapsack Problem on Current Chip-multiprocessor Systems. International Journal of Parallel, Emergent and Distributed Systems (IJPEDS), 27, 19–44.
Qasem, A. M., & Unkule, S. (2011). Register Pressure Aware Code Transformations On GPU (Extended Abstract), 19–20.
Qasem, A. M., Rahman, F., & Yi, Q. (2011). Understanding Stencil Code Performance On Multicore Architecture (pp. 30–45).
Qasem, A. M., Novoa, C., Rashid, H., & McKenney, M. (2011). Dynamic Programming Solutions for the Integral Knapsack Problem on Multicore Architectures (Extended Abstract).

2010

**Rashid, H., Novoa Ramirez, C. M., & Qasem, A. M. (2010). An Evaluation of Parallel Knapsack Algorithms on Multicore Architectures. In Proceedings of the 2010 International Conference on Super Computing (CSC’10) (pp. 230–235).
Qasem, A. M., & Sarangkar, S. (2010). Intelligent Feedback For Fast and Effective Autotuning (Extended Abstract).
Qasem, A. M., Yi, Q., & Guo, J. (2010). Evaluating the Role of Optimization-Specific Search Heuristics in Effective Autotuning (short paper).
Qasem, A. M., Guo, J., Rahman, F., & Yi, Q. (2010). Exposing Tunable Parameters in Multithreaded Numerical Code (pp. 46–60).
Qasem, A. M. (2010). Locality-Conscious Superpaging for Improved TLB Behavior of Stencil Computations.
Qasem, A. M., Yi, Q., & Sarangkar, S. (2010). Improving Autotuning Efficiency And Portability Through Feedback Diagnostics.
Qasem, A. M., Rashid, H., & Novoa, C. (2010). An Evaluation Of Parallel Knapsack Algorithms On Multicore Architectures (pp. 230–235).
Qasem, A. M., & Sarangkar, S. (2010). Restructuring Parallel Loops to Curb False Sharing on Multicore Architectures (pp. 1–7).

2009

Qasem, A. M., & Cade, J. (2009). Balancing Data Locality And Parallelism on Shared-cache Multi-Core Systems (pp. 188–195).
Qasem, A. M., & Magee, J. (2009). A Case for Compiler-driven Superpage Allocation.

2008

Qasem, A. M., & Kennedy, K. (2008). Model-guided Empirical Tuning of Loop Fusion. International Journal of High Performance Systems Architecture (IJHPSA), 1, 183–198.
Qasem, A. M., & Yi, Q. (2008). Exploring the optimization space of dense linear algebra kernels (pp. 343–355).
Qasem, A. M. (2008). Evaluating an Early Stop Criterion and a Statistical Pruning Strategy of the Optimization Search Space (pp. 506–510).

2007

Qasem, A. M., & Kennedy, K. (2007). Pruning the Optimization Search Space Using Architecture-aware Cost Models.

2006

Qasem, A. M., Kennedy, K., & Mellor-Crummey, J. (2006). Automatic Tuning of Whole Applications Using Direct Search and a Performance-based Transformation System. The Journal of Supercomputing, 36, 183–196.
Qasem, A. M., & Kennedy, K. (2006). Profitable Loop Fusion and Tiling Using Model-driven Empirical Search (pp. 249–258).

2005

Qasem, A. M., & Kennedy, K. (2005). A Cache-conscious Profitability Model for Empirical Tuning of Loop Fusion (pp. 106–120).
Qasem, A. M., & Kennedy, K. (2005). Evaluating a Model for Cache Conflict Miss Prediction (Vol. Technical Report CS-TR05-457). Department of Computer Science, Rice University.

2004

Qasem, A. M., Kennedy, K., & Mellor-Crummey, J. (2004). Automatic Tuning of Whole Applications Using Direct Search and a Performance-based Transformation System.

2003

Qasem, A. M., Jin, G., & Mellor-Crummey, J. (2003). Improving Performance with Integrated Program Transformations (Vol. Technical Report CS-TR03-419). Department of Computer Science, Rice University.

2002

Qasem, A. M., Fowler, R., Mellor-Crummey, J., & Jin, G. (2002). A Source-to-source Loop Transformation Tool. Extended poster abstract.

2001

Qasem, A. M., Whalley, D., Yuan, X., & van Engelen, R. (2001). Using a Swap Instruction to Coalesce Loads and Stores (pp. 235–240).
Qasem, A. M., & Whalley, D. (2001). Using a Swap Instruction to Coalesce Loads and Stores (Vol. TR-010403). Department of Computer Science, Florida State University.