Broadly, my interests lie in the realm of Data Infrastructure, Large scale Machine Learning and Distributed Systems. My current work at LinkedIn involves solving numerous distributed computing challenges using Spark and MapReduce. I am working on optimizing a library (originally written in Java MapReduce, we are rewriting this in Spark) which does massive statistical aggregations on Petabytes of LinkedIn data. This aggregated data is then either pushed to visualization tools for Data scientists to slice and dice across dimensions, or used for Machine learning based modeling for recommender systems at LinkedIn. This project is giving me a hands-on experience in writing scalable and efficient Spark and MapReduce programs, and also how to debug them. For this project, I also had to write numerous Spark utility functions, my favorite one being a sampling based Volume Estimation Algorithm for Spark dataframes which accurately predicts the data size (with 99% precision) before materializing the dataframe to HDFS. Previously, I worked on a MapReduce based lifecycle management tool for retention of 6000+ datasets produced in our platform (Gobblin/MapReduce jobs that read user retention overrides or apply the default retention policies).

Apart from this I am an active interviewer at LinkedIn, conducting interviews on 2 modules of Algorithms and 3 modules of Infrastructure design. I am also a member of the Data Relevance Interview revamp committee, working on creating and reviewing new problems on Distributed Computing Algorithm design (problems involving Spark or MapReduce to solve large scale data challenges) and Machine Learning system design (problems involving designing large scale Machine learning and Recommender systems).

My research at CMU was on optimizing the performance of Large-scale Machine Learning systems, where training is performed by running stochastic gradient descent in a distributed fashion (using a central parameter server and multiple learner servers). In such systems, synchronization delays due to straggling learners can significantly increase the run-time of the training job. We worked on new algorithms that try to mitigate straggler effects in Synchronous SGD, whilst keeping a decent gradient quality and convergence time. We also worked on algorithms to reduce communication overhead in Distributed Synchronous SGD.

In Summer 2017, I was a Software Engineering intern with LinkedIn's Data team , where I wrote a data tooling library in Scala for query translation from Hive to Spark. To test my library, I prepared an optimized workflow (after tuning various Spark metrics) for Metrics Computation on LinkedIn's Ads data pipeline.