datawaver

Optimized Apache Spark Execution on AWS EMR

Welcome to this organization on GitHub! This organization serves as a central hub for all repositories related to the EMRE project, which focuses on optimizing the execution of Apache Spark jobs on Amazon Web Services (AWS) Elastic MapReduce (EMR). We have adopted a multi-repository strategy to facilitate collaboration, modularity, and maintainability.

TLDR;

EMRE is an approach to deploy and run a Spark cluster which can be fully utilized (90% ?) and execute jobs in a predictable time.

About EMRE

EMRE is a project focused on optimizing the execution of Apache Spark jobs on Amazon Web Services (AWS) Elastic MapReduce (EMR). The project follows a multi-repository strategy, where each repository represents a self-contained module that encapsulates specific knowledge, best practices, and tools related to running Spark Jobs. These modular repositories act as building blocks that can be independently utilized, enabling developers and data engineers to selectively incorporate the components that align with their specific use cases and requirements. By providing a collection of well-defined and reusable modules, EMRE aims to streamline the process of optimizing Spark workloads on EMR, promoting code reusability, maintainability, and collaboration.

Please be aware that the maturity and level of completeness vary across the repositories and are currently at an early stage. We believe that sharing these resources early for reference and discussion is more beneficial than waiting until they are "finished." It's possible that the entire project idea may not be suitable for all use cases, but the individual components and techniques should still provide value to the community.

Repositories

Within the EMRE organization, you will find a collection of repositories, each focused on a specific aspect of running Apache Spark on AWS EMR. These repositories include:

Documentation and guidelines for Spark best practices and optimization techniques on AWS EMR [emre-config]
Tools and utilities for Spark job profiling, debugging, and performance analysis [emre-spark]
Deploying and setting up Apache Spark jobs [emre-airflow]

Contributing

We encourage contributions from the community, data engineers, and anyone passionate about optimizing Apache Spark on AWS EMR. If you would like to contribute to any of the EMRE repositories, please refer to the specific repository's documentation for contribution guidelines and instructions.

Contact

If you have any questions, suggestions, or feedback regarding the EMRE project or any of its repositories, please don't hesitate to reach out to us through the available communication channels listed in each repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datawaver

Optimized Apache Spark Execution on AWS EMR

About EMRE

Repositories

Contributing

Contact

Pinned Loading

Repositories

People

Top languages

Most used topics