Skip to content
@datawaver

datawaver

Just a wrapper org for collaboration

Optimized Apache Spark Execution on AWS EMR

Welcome to this organization on GitHub! This organization serves as a central hub for all repositories related to the EMRE project, which focuses on optimizing the execution of Apache Spark jobs on Amazon Web Services (AWS) Elastic MapReduce (EMR). We have adopted a multi-repository strategy to facilitate collaboration, modularity, and maintainability.

TLDR;

EMRE is an approach to deploy and run a Spark cluster which can be fully utilized (90% ?) and execute jobs in a predictable time.

About EMRE

EMRE is a project focused on optimizing the execution of Apache Spark jobs on Amazon Web Services (AWS) Elastic MapReduce (EMR). The project follows a multi-repository strategy, where each repository represents a self-contained module that encapsulates specific knowledge, best practices, and tools related to running Spark Jobs. These modular repositories act as building blocks that can be independently utilized, enabling developers and data engineers to selectively incorporate the components that align with their specific use cases and requirements. By providing a collection of well-defined and reusable modules, EMRE aims to streamline the process of optimizing Spark workloads on EMR, promoting code reusability, maintainability, and collaboration.

Please be aware that the maturity and level of completeness vary across the repositories and are currently at an early stage. We believe that sharing these resources early for reference and discussion is more beneficial than waiting until they are "finished." It's possible that the entire project idea may not be suitable for all use cases, but the individual components and techniques should still provide value to the community.

Repositories

Within the EMRE organization, you will find a collection of repositories, each focused on a specific aspect of running Apache Spark on AWS EMR. These repositories include:

  • Documentation and guidelines for Spark best practices and optimization techniques on AWS EMR [emre-config]
  • Tools and utilities for Spark job profiling, debugging, and performance analysis [emre-spark]
  • Deploying and setting up Apache Spark jobs [emre-airflow]

Contributing

We encourage contributions from the community, data engineers, and anyone passionate about optimizing Apache Spark on AWS EMR. If you would like to contribute to any of the EMRE repositories, please refer to the specific repository's documentation for contribution guidelines and instructions.

Contact

If you have any questions, suggestions, or feedback regarding the EMRE project or any of its repositories, please don't hesitate to reach out to us through the available communication channels listed in each repository.

Pinned Loading

  1. emre-config emre-config Public

    Configure compute part of EMR clusters

    Python

  2. spark-docker spark-docker Public

    Local Spark cluster setup (automated, reproducible) for debugging and testing behaviour and config

    Just

Repositories

Showing 4 of 4 repositories
  • emre-airflow Public

    Use Airflow to create and run Spark Jobs with an EMRE Spark cluster

    datawaver/emre-airflow’s past year of commit activity
    Python 0 MIT 0 0 0 Updated Jun 26, 2024
  • .github Public
    datawaver/.github’s past year of commit activity
    0 0 0 0 Updated Jun 19, 2024
  • spark-docker Public

    Local Spark cluster setup (automated, reproducible) for debugging and testing behaviour and config

    datawaver/spark-docker’s past year of commit activity
    Just 0 MIT 0 0 0 Updated Jun 18, 2024
  • emre-config Public

    Configure compute part of EMR clusters

    datawaver/emre-config’s past year of commit activity
    Python 0 MIT 0 0 0 Updated May 29, 2024

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…