Amazon data engineer python interview questions

Question

Amazon data engineer python interview questions

1 Answer

Robindeniel · Answer 1 · 2022-08-22T04:32:39+0000

The role of an Amazon Data Engineer

Much like several other top companies, Amazon has data engineering roles as one of their most critical hires in which they’re expanding roles dramatically. Data engineering involves the collection and validation of data that can help the company meet its objectives. Data engineers face very unique challenges as to what kind of data must be selected, processed, and shaped, and to do this with competence makes it one of the most challenging jobs out there.

Data Engineers work alongside product managers, designers, data scientists, software engineers, and are an integral part of the team. They are responsible for extracting most of the data and transforming it into pipelines, which the rest of the team works upon. In simple words, a data engineer manages data and the data scientists explore it.

Some qualities of a Data Engineer at Amazon:

Good command of database systems such as SQL and other programming languages is essential to be able to work on complex datasets.

In-depth knowledge and experience with big data processing frameworks such as Hadoop or ApacheSpark, and analytical environments.

Additionally, Amazon has and operates by its 14 leadership skills, and hence looks for people who live these principles every day.

Amazon Data Engineer preparation

1. Data warehouse Concepts

Make sure you have a good fundamental understanding of data warehouse concepts and the purpose of

building a system for BI and analytics, how OLTP and OLAP differ in terms of storage, processing, and data.

2. Data Modeling

Considering considerations for a deeper understanding of normalized (Third Normal Form) and denormalized design (Dimensional Modeling Star Schema, Snow-Flaked Schema) is very useful.

DW modeling techniques and a strategic understanding of translating requirements into appropriate data model structures, including business, logical, and physical models.

3. Database concepts and performance tuning

Ability to tune a complex query on a huge data using execution plan, understanding join mechanisms and memory disk IO usage considerations. Using parallelism and understanding of optimum data archiving and purging strategies will also be handy.

4. ETL

Maintaining ETL data pipeline, from raw data to target mapping and aggregations, managing data transformations, etc.

5. SQL and Reporting Concepts

Good SQL skills and experience developing dashboards, scorecards with tools such as Tableau, Power BI.

Amazon Data Engineer Interview Sample Questions

1. SQL

DML: SQL Querying, query optimization, JOINS (including Self-Joins with complex ON conditions), Window Functions (with an understanding of how to implement these using self joins), debug a slow-running SQL query.

DDL: INSERT/UPDATE, adding indexes.

2. Database/Datawarehouse general concept, knowledge.

Data Modeling: Ability to design schema for a given scenario (ie, rideshare service, food delivery service, etc), Build queries to access data in the model, database normalization, star schema (vs. snowflake)

3. ETL

Batching: all data provided in chunks, schedule daily/hourly/etc jobs, experience with job schedulers such as Airflow, Luigi, Azkaban.

Streaming: Data provided as it comes, need to maintain state in various forms, how to aggregate with the state in SQL (ie, new data vs historical table), user session stitching.

4. Python

Main python data structures (base types, lists, dictionaries)

Function creation / handling

Decorators

5. Product Sense

Define different product metrics such as user engagement, day over day changes and write SQL queries or Python/Spark scripts to populate those metrics.

6. Leadership/Behavioral Questions

How have you handled conflict?

What’s a decision you made you were proud of?

Data engineer manager interview questions

For engineering manager positions, the questions are related to decision making, business understanding, curating and maintaining datasets, compliance, and security policies.

What is the difference between a data warehouse and an operational database?

A Data warehouse serves historical data for data analytics tasks and decision making. It supports high-volume analytical processing such as OLAP. Data warehouses are designed to load high and complex queries that access multiple rows. The system supports a few concurrent users, and it is designed to retrieve fast and high volumes of data. Operational Database Management Systems are used to manage dynamic datasets in real-time. They support high-volume transaction processing for thousands of concurrent clients. Usually, the data consists of day-to-day information about business operations.

Why do you think every firm using data systems requires a disaster recovery plan?

Disaster management is the most crucial part of a data engineer manager's job. The data engineer manager plans and prepares for disaster recovery for various data storage systems. This task involves real-time backing up of files and media. The backup storage will be used to restore the files in case of a cyber-attack or equipment failure. Security protocols are placed to monitor, trace, and restrict both incoming and outgoing traffic.

Data Engineer Technical Interview Questions

Data Engineering Tools

What is data orchestration, and what tools can you use to perform it?

Data orchestration is an automated process for accessing raw data from multiple sources, performing data cleaning, transformation, and modeling techniques, and serving it for analytical tasks. The most popular tools are Apache Airflow, Prefect, Dagster, and AWS Glue.

What tools do you use for analytics engineering?

Analytical engineering is a process where we access the processed data, transform it, apply statistical modeling, and visualize it in the form of reports and dashboards. The popular tools are dbt (data build tool), BigQuery, Postgres, Metabase, Google Data Studio, and Tableau.

Python interview questions for data engineers

Which Python libraries are most efficient for data processing?

The most popular libraries for data processing are pandas and Numpy. For parallel processing of large datasets, we use Dask, Pyspark, Datatable, and Rapids. They all have pros and cons, and we must understand the application based on data requirements.

How do you perform web scraping in Python?

Access webpage using request library and URL
Extract tables and information using BeautifulSoup
Convert it into the structure for using Pandas
Clean it using Pandas and Numpy
Save the data in the form of a CSV file
In some cases, pandas.read_html works wonders. It extracts, processes, and converts data in a structured format.

Data Engineer Interview Questions on Big Data

Differentiate between relational and non-relational database management systems.

Relational Database Management Systems (RDBMS)	Non-relational Database Management Systems
Relational Databases primarily work with structured data using SQL (Structured Query Language). SQL works on data arranged in a predefined schema.	Non-relational databases support dynamic schema for unstructured data. Data can be graph-based, column-oriented, document-oriented, or even stored as a Key store.
RDBMS follow the ACID properties - atomicity, consistency, isolation, and durability.	Non-RDBMS follow the Brewers Cap theorem - consistency, availability, and partition tolerance.
RDBMS are usually vertically scalable. A single server can handle more load by increasing resources such as RAM, CPU, or SSD.	Non-RDBMS are horizontally scalable and can handle more traffic by adding more servers to handle the data.
Relational Databases are a better option if the data requires multi-row transactions to be performed on it since relational databases are table-oriented.	Non-relational databases are ideal if you need flexibility for storing the data since you cannot create documents without having a fixed schema. Since non-RDBMS are horizontally scalable, they can become more powerful and suitable for large or constantly changing datasets.
E.g. PostgreSQL, MySQL, Oracle, Microsoft SQL Server.	E.g. Redis, MongoDB, Cassandra, HBase, Neo4j, CouchDB

What is data modeling?

Data modeling is a technique that defines and analyzes the data requirements needed to support business processes. It involves creating a visual representation of an entire system of data or a part of it. The process of data modeling begins with stakeholders providing business requirements to the data engineering team.

How is a data warehouse different from an operational database?

Data warehouse	Operational database
Data warehouses generally support high-volume analytical data processing - OLAP.	Operational databases support high-volume transaction processing, typically - OLTP.
You may add new data regularly, but once you add the data, it does not change very frequently.	Data is regularly updated.
Data warehouses are optimized to handle complex queries, which can access multiple rows across many tables.	Operational databases are ideal for queries that return single rows at a time per table.
There is a large amount of data involved.	The amount of data is usually less.
A data warehouse is usually suitable for fast retrieval of data from relatively large volumes of data.	Operational databases are optimized to handle fast inserts and updates on a smaller scale of data.

What are the big four V’s of big data?

Volume: refers to the size of the data sets to be analyzed or processed. The size is generally in terabytes and petabytes.
Velocity: the speed at which you generate data. The data generates faster than traditional data handling techniques can handle it.
Variety: the data can come from various sources and contain structured, semi-structured, or unstructured data.
Veracity: the quality of the data to be analyzed. The data has to be able to contribute in a meaningful way to generate results.

Differentiate between Star schema and Snowflake schema.

Star schema	Snowflake Schema
Star schema is a simple top-down data warehouse schema that contains the fact tables and the dimension tables.	The snowflake schema is a bottom-up data warehouse schema that contains fact tables, dimension tables, and sub-dimension tables.
Takes up more space.	Takes up less space.
Takes less time for query execution.	Takes more time for query execution than star schema.
Normalization is not useful in a star schema, and there is high data redundancy.	Normalization and denormalization are useful in this data warehouse schema, and there is less data redundancy.
The design and understanding are simpler than the Snowflake schema, and the Star schema has low query complexity.	The design and understanding are a little more complex. Snowflake schema has higher query complexity than Star schema.
There are fewer foreign keys.	There are many foreign keys.

What are the differences between OLTP and OLAP?

OLTP (Online Transaction Processing) Systems	OLAP (Online Analytical Processing ) Systems
System for modification of online databases.	System for querying online databases.
Supports insert, update and delete transformations on the database.	Supports extraction of data from the database for further analysis.
OLTP systems generally have simpler queries that require less transactional time.	OLAP queries generally have more complex queries which require more transactional time.
Tables in OLTP are normalized.	Tables in OLAP are not normalized.

What are some differences between a data engineer and a data scientist?

Data engineers and data scientists work very closely together, but there are some differences in their roles and responsibilities.

Data Engineer	Data scientist
The primary role is to design and implement highly maintainable database management systems.	The primary role of a data scientist is to take raw data presented on the data and apply analytic tools and modeling techniques to analyze the data and provide insights to the business.
Data engineers transform the big data into a structure that one can analyze.	Data scientists perform the actual analysis of Big Data.
They must ensure that the infrastructure of the databases meets industry requirements and caters to the business.	They must analyze the data and develop problem statements that can process the data to help the business.
Data engineers have to take care of the safety, security and backing up of the data, and they work as gatekeepers of the data.	Data scientists should have good data visualization and communication skills to convey the results of their data analysis to various stakeholders.
Proficiency in the field of big data, and strong database management skills.	Proficiency in machine learning is a requirement.

A data scientist and data engineer role require professionals with a computer science and engineering background, or a closely related field such as mathematics, statistics, or economics. A sound command over software and programming languages is important for a data scientist and a data engineer. Read more for a detailed comparison between data scientists and data engineers.

How is a data architect different from a data engineer?

Data architect	Data engineers
Data architects visualize and conceptualize data frameworks.	Data engineers build and maintain data frameworks.
Data architects provide the organizational blueprint of data.	Data engineers use the organizational data blueprint to collect, maintain and prepare the required data.
Data architects require practical skills with data management tools including data modeling, ETL tools, and data warehousing.	Data engineers must possess skills in software engineering and be able to maintain and build database management systems.
Data architects help the organization understand how changes in data acquisitions will impact the data in use.	Data engineers take the vision of the data architects and use this to build, maintain and process the architecture for further use by other data professionals.

Differentiate between structured and unstructured data.

Structured Data	Unstructured Data
Structured data usually fits into a predefined model.	Unstructured data does not fit into a predefined data model.
Structured data usually consists of only text.	Unstructured data can be text, images, sounds, videos, or other formats.
It is easy to query structured data and perform further analysis on it.	It is difficult to query the required unstructured data.
Relational databases and data warehouses contain structured data.	Data lakes and non-relational databases can contain unstructured data. A data warehouse can contain unstructured data too.

10. How does Network File System (NFS) differ from Hadoop Distributed File System (HDFS)?

Network File System	Hadoop Distributed File System
NFS can store and process only small volumes of data.	Hadoop Distributed File System, or HDFS, primarily stores and processes large amounts of data or Big Data.
The data in an NFS exists in a single dedicated hardware.	The data blocks exist in a distributed format on local hardware drives.
NFS is not very fault tolerant. In case of a machine failure, you cannot recover the data.	HDFS is fault tolerant and you may recover the data if one of the nodes fails.
There is no data redundancy as NFS runs on a single machine.	Due to replication across machines on a cluster, there is data redundancy in HDFS.

What is meant by feature selection?

Feature selection is identifying and selecting only the features relevant to the prediction variable or desired output for the model creation. A subset of the features that contribute the most to the desired output must be selected automatically or manually.

How can missing values be handled in Big Data?

Some ways you can handle missing values in Big Data are as follows:

Deleting rows with missing values: You simply delete the rows or columns in a table with missing values from the dataset. You can drop the entire column from the analysis if a column has more than half of the rows with null values. You can use a similar method for rows with missing values in more than half of the columns. This method may not work very well in cases where a large number of values are missing.
Using Mean/Medians for missing values: In a dataset, the columns with missing values and the column's data type are numeric; you can fill in the missing values by using the median or mode of the remaining values in the column.
Imputation method for categorical data: If you can classify the data in a column, you can replace the missing values with the most frequently used category in that particular column. If more than half of the column values are empty, you can use a new categorical variable to place the missing values.
Predicting missing values: Regression or classification techniques can predict the values based on the nature of the missing values.
Last Observation carried Forward (LCOF) method: The last valid observation can fill in the missing value in data variables that display a longitudinal behavior.
Using Algorithms that support missing values: Some algorithms, such as the k-NN algorithm, can ignore a column if values are missing. Another such algorithm is Naive Bayes. The RandomForest algorithm can work with non-linear and categorical data.

What is meant by outliers?

In a dataset, an outlier is an observation that lies at an abnormal distance from the other values in a random sample from a particular data set. It is left up to the analyst to determine what can be considered abnormal. Before you classify data points as abnormal, you must first identify and categorize the normal observations. Outliers may occur due to variability in measurement or a particular experimental error. Outliers must be identified and removed before further analysis of the data not to cause any problems.

What is meant by logistic regression?

Logistic regression is a classification rather than a regression model, which involves modeling the probability of a discrete outcome given an input variable. It is a simple and efficient method that can approach binary and linear classification problems. Logistic regression is a statistical method that works well with binary classifications but can be generalized to multiclass classifications.

15. Briefly define the Star Schema.

The star join schema, one of the most basic design schemas in the Data Warehousing concept, is also known as the star schema. It looks like a star, with fact tables and related dimension tables. The star schema is useful when handling huge amounts of data.

16. Briefly define the Snowflake Schema.

The snowflake schema, one of the popular design schemas, is a basic extension of the star schema that includes additional dimensions. The term comes from the way it resembles the structure of a snowflake. In the snowflake schema, the data is organized and, after normalization, divided into additional tables.

What is the difference between the KNN and k-means methods?

The k-means method is an unsupervised learning algorithm used as a clustering technique, whereas the K-nearest-neighbor is a supervised learning algorithm for classification and regression problems.
KNN algorithm uses feature similarity, whereas the K-means algorithm refers to dividing data points into clusters so that each data point is placed precisely in one cluster and not across many.

What is the purpose of A/B testing?

A/B testing is a randomized experiment performed on two variants, ‘A’ and ‘B.’ It is a statistics-based process involving applying statistical hypothesis testing, also known as “two-sample hypothesis testing.” In this process, the goal is to evaluate a subject’s response to variant A against its response to variant B to determine which variants are more effective in achieving a particular outcome.

What do you mean by collaborative filtering?

Collaborative filtering is a method used by recommendation engines. In the narrow sense, collaborative filtering is a technique used to automatically predict a user's tastes by collecting various information regarding the interests or preferences of many other users. This technique works on the logic that if person 1 and person 2 have the same opinion on one particular issue, then person 1 is likely to have the same opinion as person 2 on another issue than another random person. In general, collaborative filtering is the process that filters information using techniques involving collaboration among multiple data sources and viewpoints.

What are some biases that can happen while sampling?

Some popular type of bias that occurs while sampling is

Undercoverage- The undercoverage bias occurs when there is an inadequate representation of some members of a particular population in the sample.
Observer Bias- Observer bias occurs when researchers unintentionally project their expectations on the research. There may be occurrences where the researcher unintentionally influences surveys or interviews.
Self-Selection Bias- Self-selection bias, also known as volunteer response bias, happens when the research study participants take control over the decision to participate in the survey. The individuals may be biased and are likely to share some opinions that are different from those who choose not to participate. In such cases, the survey will not represent the entire population.
Survivorship Bias- The survivorship bias occurs when a sample is more concentrated on subjects that passed the selection process or criterion and ignore the subjects who did not pass the selection criteria. Survivorship biases can lead to overly optimistic results.
Recall Bias- Recall bias occurs when a respondent fails to remember things correctly.
Exclusion Bias- The exclusion bias occurs due to the exclusion of certain groups while building the sample.

What is a distributed cache?

A distributed cache pools the RAM in multiple computers networked into a single in-memory data store to provide fast access to data. Most traditional caches tend to be in a single physical server or hardware component. Distributed caches, however, grow beyond the memory limits of a single computer as they link multiple computers, providing larger and more efficient processing power. Distributed caches are useful in environments that involve large data loads and volumes. They allow scaling by adding more computers to the cluster and allowing the cache to grow based on requirements.

Explain how Big Data and Hadoop are related to each other.

Apache Hadoop is a collection of open-source libraries for processing large amounts of data. Hadoop supports distributed computing, where you process data across multiple computers in clusters. Previously, if an organization had to process large volumes of data, it had to buy expensive hardware. Hadoop has made it possible to shift the dependency from hardware to achieve high performance, reliability, and fault tolerance through the software itself. Hadoop can be useful when there is Big Data and insights generated from the Big Data. Hadoop also has robust community support and is evolving to process, manage, manipulate and visualize Big Data in new ways.

21. Briefly define COSHH.

COSHH is an acronym for Classification and Optimization-based Scheduling for Heterogeneous Hadoop systems. As the name implies, it offers scheduling at both the cluster and application levels to speed up job completion.

22. Give a brief overview of the major Hadoop components.

Working with Hadoop involves many different components, some of which are listed below:

Hadoop Common: This comprises all the tools and libraries typically used by the Hadoop application.
Hadoop Distributed File System (HDFS): When using Hadoop, all data is present in the HDFS, or Hadoop Distributed File System. It offers an extremely high bandwidth distributed file system.
Hadoop YARN: The Hadoop system uses YARN, or Yet Another Resource Negotiator, to manage resources. YARN can also be useful for task scheduling.
Hadoop MapReduce: Hadoop MapReduce is a framework for large-scale data processing that gives users access.

23. List some of the essential features of Hadoop.

Hadoop is a user-friendly open source framework.
Hadoop is highly scalable. Hadoop can handle any sort of dataset effectively, including unstructured (MySQL Data), semi-structured (XML, JSON), and structured (MySQL Data) (Images and Videos).
Parallel computing ensures efficient data processing in Hadoop.
Hadoop ensures data availability even if one of your systems crashes by copying data across several DataNodes in a Hadoop cluster.

24. What methods does Reducer use in Hadoop?

The three primary methods to use with reducer in Hadoop are as follows:

setup(): This function is mostly useful to set input data variables and cache protocols.
cleanup(): This procedure is useful for deleting temporary files saved.
reduce(): This method is used only once for each key and is the most crucial component of the entire reducer.

25. What are the various design schemas in data modeling?

There are two fundamental design schemas in data modeling: star schema and snowflake schema.

Star Schema- The star schema is the most basic type of data warehouse schema. Its structure is similar to that of a star, where the star's center may contain a single fact table and several associated dimension tables. The star schema is efficient for data modeling tasks such as analyzing large data sets.
Snowflake Schema- The snowflake schema is an extension of the star schema. In terms of structure, it adds more dimensions and has a snowflake-like appearance. Data is split into additional tables, and the dimension tables are normalized.

26. What are the components that the Hive data model has to offer?

Some major components in a Hive data model are

Buckets
Tables
Partitions.

Data Engineer Interview Questions on Python

Python is crucial in implementing data engineering techniques. Pandas, NumPy, NLTK, SciPy, and other Python libraries are ideal for various data engineering tasks such as faster data processing and other machine learning activities. Data engineers primarily focus on data modeling and data processing architecture but also need a fundamental understanding of algorithms and data structures. Take a look at some of the data engineer interview questions based on various Python concepts, including Python libraries, algorithms, data structures, etc. These data engineer interview questions cover Python libraries like Pandas, NumPy, and SciPy.

Differentiate between *args and **kwargs.

*args in function definitions are used to pass a variable number of arguments to a function when calling the function. By using the *, a variable associated with it becomes iterable.
**kwargs in function definitions are used to pass a variable number of keyworded arguments to a function while calling the function. The double star allows passing any number of keyworded arguments.

What is the difference between “is” and “==”?

Python's “is” operator checks whether two variables point to the same object. “==” is used to check whether the values of two variables are the same.

E.g. consider the following code:

a = [1,2,3]

b = [1,2,3]

c = b

a == b

evaluates to true since the values contained in the list a and list b are the same but

a is b

evaluates to false since a and b refers to two different objects.

c is b

Evaluates to true since c and b point to the same object.

How is memory managed in Python?

Memory in Python exists in the following way:

The objects and data structures initialized in a Python program are present in a private heap, and programmers do not have permission to access the private heap space.
You can allocate heap space for Python objects using the Python memory manager. The core API of the memory manager gives the programmer access to some of the tools for coding purposes.
Python has a built-in garbage collector that recycles unused memory and frees up memory for heap space.

What is a decorator?

A decorator is a tool in Python which allows programmers to wrap another function around a function or a class to extend the behavior of the wrapped function without making any permanent modifications to it. Functions in Python are first-class objects, meaning functions can be passed or used as arguments. A function works as the argument for another function in a decorator, which you can call inside the wrapper function.

Are lookups faster with dictionaries or lists in Python?

The time complexity to look up a value in a list in Python is O(n) since the whole list iterates through to find the value. Since a dictionary is a hash table, the time complexity to find the value associated with a key is O(1). Hence, a lookup is generally faster with a dictionary, but a limitation is that dictionaries require unique keys to store the values.

How can you return the binary of an integer?

The bin() function works on a variable to return its binary equivalent.

How can you remove duplicates from a list in Python?

A list can be converted into a set and then back into a list to remove the duplicates. Sets do not contain duplicate data in Python.

E.g.

list1 = [5,9,4,8,5,3,7,3,9]

list2 = list(set(list1))

list2 will contain [5,9,4,8,3,7]

Set() may not maintain the order of items within the list.

What is the difference between append and extend in Python?

The argument passed to append() is added as a single element to a list in Python. The list length increases by one, and the time complexity for append is O(1).

The argument passed to extend() is iterated over, and each element of the argument adds to the list. The length of the list increases by the number of elements in the argument passed to extend(). The time complexity for extend is O(n), where n is the number of elements in the argument passed to extend.

Consider:

list1 = [“Python”, “data”, “engineering”]

list2 = [“projectpro”, “interview”, “questions”]

list1.append(list2)

List1 will now be : [“projectpro”, “interview”, “questions”, [“Python”, “data”, “engineering”]]

The length of list1 is 4.

Instead of append, use extend

list1.extend(list2)

List1 will now be : [“projectpro”, “interview”, “questions”, “Python”, “data”, “engineering”]

The length of list1, in this case, becomes 6.

When do you use pass, continue and break?

The break statement in Python terminates a loop or another statement containing the break statement. If a break statement is present in a nested loop, it will terminate only the loop in which it is present. Control will pass the statements after the break statement if they are present.

The continue statement forces control to stop the current iteration of the loop and execute the next iteration rather than terminating the loop completely. If a continue statement is present within a loop, it leads to skipping the code following it for that iteration, and the next iteration gets executed.

Pass statement in Python does nothing when it executes, and it is useful when a statement is syntactically required but has no command or code execution. The pass statement can write empty loops and empty control statements, functions, and classes.

How can you check if a given string contains only letters and numbers?

str.isalnum() can be used to check whether a string ‘str’ contains only letters and numbers.

Mention some advantages of using NumPy arrays over Python lists.

NumPy arrays take up less space in memory than lists.
NumPy arrays are faster than lists.
NumPy arrays have built-in functions optimized for various techniques such as linear algebra, vector, and matrix operations.
Lists in Python do not allow element-wise operations, but NumPy arrays can perform element-wise operations.

In Pandas, how can you create a dataframe from a list?

import pandas as pd

days = [‘Sunday’, ‘Monday’, ‘Tuesday’, ‘Wednesday’]

# Calling DataFrame constructor on list

df = pd.DataFrame(days)

df is the data frame created from the list ‘days’.

df = pd.DataFrame(days, index =[‘1’,’2’,’3’,’4’], columns=[‘Days’])

Can be used to create the data frame and the values for the index and columns.

In Pandas, how can you find the median value in a column “Age” from a dataframe “employees”?

The median() function can be used to find the median value in a column. E.g.- employees[“age”].median()

In Pandas, how can you rename a column?

The rename() function can be used to rename columns of a data frame.

To rename address_line_1 to ‘region’ and address_line_2 to ‘city’

employees.rename(columns=dict(address_line_1=’region’, address_line_2=’city’))

How can you identify missing values in a data frame?

The isnull() function help to identify missing values in a given data frame.

The syntax is DataFrame.isnull()

It returns a dataframe of boolean values of the same size as the data frame in which missing values are present. The missing values in the original data frame are mapped to true, and non-missing values are mapped to False.

What is SciPy?

SciPy is an open-source Python library that is useful for scientific computations. SciPy is short for Scientific Python and is used to solve complex mathematical and scientific problems. SciPy is built on top of NumPy and provides effective, user-friendly functions for numerical optimization. The SciPy library comes equipped with functions to support integration, ordinary differential equation solvers, special functions, and support for several other technical computing functions.

Given a 5x5 matrix in NumPy, how will you inverse the matrix?

The function numpy.linalg.inv() can help you inverse a matrix. It takes a matrix as the input and returns its inverse. You can calculate the inverse of a matrix M as:

if det(M) != 0

M-1 = adjoint(M)/determinant(M)

else

"Inverse does not exist

What is an ndarray in NumPy?

In NumPy, an array is a table of elements, and the elements are all of the same types and you can index them by a tuple of positive integers. To create an array in NumPy, you must create an n-dimensional array object. An ndarray is the n-dimensional array object defined in NumPy to store a collection of elements of the same data type.

Using NumPy, create a 2-D array of random integers between 0 and 500 with 4 rows and 7 columns.

from numpy import random

x = random.randint(500, size=(4, 7))

Find all the indices in an array of NumPy where the value is greater than 5.

import NumPy as np

array = np.array([5,9,6,3,2,1,9])

To find the indices of values greater than 5

print(np.where(array>5))

Gives the output (array([0,1,2,6])

Amazon data engineer python interview questions

Please log in or register to answer this question.

1 Answer

Data Engineer Interview Questions on Big Data

Differentiate between relational and non-relational database management systems.

What is data modeling?

How is a data warehouse different from an operational database?

What are the big four V’s of big data?

Differentiate between Star schema and Snowflake schema.

What are the differences between OLTP and OLAP?

What are some differences between a data engineer and a data scientist?

How is a data architect different from a data engineer?

10. How does Network File System (NFS) differ from Hadoop Distributed File System (HDFS)?

What is meant by feature selection?

How can missing values be handled in Big Data?

What is meant by outliers?

What is meant by logistic regression?

15. Briefly define the Star Schema.

16. Briefly define the Snowflake Schema.

What is the difference between the KNN and k-means methods?

What is the purpose of A/B testing?

What do you mean by collaborative filtering?

What are some biases that can happen while sampling?

What is a distributed cache?

Explain how Big Data and Hadoop are related to each other.

21. Briefly define COSHH.

22. Give a brief overview of the major Hadoop components.

23. List some of the essential features of Hadoop.

24. What methods does Reducer use in Hadoop?

25. What are the various design schemas in data modeling?

26. What are the components that the Hive data model has to offer?

Data Engineer Interview Questions on Python

Differentiate between *args and **kwargs.

What is the difference between “is” and “==”?

How is memory managed in Python?

What is a decorator?

Are lookups faster with dictionaries or lists in Python?

How can you return the binary of an integer?

How can you remove duplicates from a list in Python?

What is the difference between append and extend in Python?

When do you use pass, continue and break?

How can you check if a given string contains only letters and numbers?

Mention some advantages of using NumPy arrays over Python lists.

In Pandas, how can you create a dataframe from a list?

In Pandas, how can you find the median value in a column “Age” from a dataframe “employees”?

In Pandas, how can you rename a column?

How can you identify missing values in a data frame?

What is SciPy?

Given a 5x5 matrix in NumPy, how will you inverse the matrix?

What is an ndarray in NumPy?

Using NumPy, create a 2-D array of random integers between 0 and 500 with 4 rows and 7 columns.

Find all the indices in an array of NumPy where the value is greater than 5.