Data Analyst interview questions and answers for 2025
Data Analyst Interview Questions for Freshers and Intermediate Levels
What is the role of a Data Analyst?
A Data Analyst is responsible for cleaning, analyzing, and interpreting raw data to derive meaningful insights that help guide business decisions.
They often work with various data sources, use tools like SQL, Excel, Python, or BI platforms (e.g., Tableau, Power BI) to identify patterns, trends, and correlations, and then communicate these findings to stakeholders through reports, dashboards, and presentations.
What is the difference between structured and unstructured data?
- Structured Data: Organized into predefined formats like tables, with rows and columns (e.g., relational databases). It’s easily searchable and manageable using SQL queries.
- Unstructured Data: Lacks a fixed schema or format (e.g., text files, images, audio, emails). It requires more complex methods (e.g., text mining, natural language processing) for analysis.
What is the difference between ETL and ELT? What are their pros and cons?
- ETL (Extract, Transform, Load): Data is extracted from source systems, transformed into the desired format, and then loaded into the destination system (e.g., data warehouse).
- Pros: Better control over data quality; data is cleaned and transformed before loading.
- Cons: Slower processing; may require more resources during the transformation phase.
- ELT (Extract, Load, Transform): Data is extracted, loaded into the destination system, and then transformed as needed.
- Pros: Faster, especially with cloud-based data warehouses; leverages the power of modern databases for transformation.
- Cons: Can result in unclean data in the warehouse; more reliance on the destination system for transformations.
How would you handle missing data in a column that contains an excessive number of outliers?
To handle missing data in a column with excessive outliers, I would:
- Analyze the distribution: Understand the nature of the outliers and the missing data—whether they’re random or represent a pattern.
- Use robust imputation methods: Employ techniques like median imputation or model-based approaches (e.g., k-NN, regression) that are less sensitive to outliers.
- Consider transformation: Apply transformations like logarithmic or winsorization to minimize the effect of outliers.
- Remove or cap extreme values: If the outliers are errors, consider removing them or capping values to a reasonable range before imputation.
What is the difference between a primary key and a unique key in relational databases?
A primary key uniquely identifies each record in a table and cannot contain NULL values. There can only be one primary key per table.
A unique key also ensures the uniqueness of values in a column but allows for NULL values, and a table can have multiple unique keys.
Explain the concept of normalization in databases.
Normalization is the process of organizing data to reduce redundancy and improve data integrity. Common forms like 1st Normal Form (1NF), 2nd Normal Form (2NF), and 3rd Normal Form (3NF) aim to:
- Ensure each table has a primary key.
- Remove partial dependencies (2NF).
- Remove transitive dependencies (3NF).
Ultimately, normalization improves consistency and reduces anomalies (insertion, update, deletion).
How would you approach combining data from multiple sources with different structures in SQL, ensuring data integrity and consistency?
To combine data from multiple sources with different structures in SQL, I would use the following approach:
- Identify common keys: First, find the common columns (keys) that can be used for joining tables.
- Use appropriate JOINs: Depending on the use case, I’d use INNER JOIN (to fetch matching records), LEFT JOIN (to keep all records from the left table), RIGHT JOIN, or FULL OUTER JOIN to combine data from different sources.
- Data cleaning and transformations: Ensure the data types match and handle any missing or inconsistent data before joining.
- Validation: After the join, validate the results to check for data integrity, ensuring that no records are unintentionally excluded or duplicated.
Explain the difference between INNER JOIN and LEFT JOIN.
- INNER JOIN: Returns only rows that have matching values in both tables.
- LEFT JOIN: Returns all rows from the left table and matched rows from the right table. If there is no match, the result will contain NULL for columns from the right table.
What is a subquery in SQL?
A subquery is a query nested inside another SQL query (SELECT, INSERT, UPDATE, or DELETE). It can be used to filter results based on conditions derived from another result set.
Example:
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
What is the difference between OLTP and OLAP?
- OLTP (Online Transaction Processing): Handles day-to-day transactional operations. Databases are normalized, allowing fast reads/writes of small amounts of data.
- OLAP (Online Analytical Processing): Used for analysis and decision-making. Databases are often denormalized (star/snowflake schemas) for fast querying and aggregations over large volumes of data.
How do you detect and handle outliers in a dataset?
- Detection: Use statistical methods such as the Interquartile Range (IQR), Z-scores, or visualization techniques like box plots.
- Handling: Could involve removing outliers if they are errors, applying transformations (e.g., log transformation), or using robust statistical methods that are less sensitive to extreme values. Additionally, incorporate domain-specific handling to understand outlier behavior from the perspective of the domain, which can help in determining the appropriate way to address them.
What is a Pivot Table, and how is it useful?
A Pivot Table (often in Excel or BI tools) is a feature that summarizes large datasets by grouping and aggregating values. It helps quickly generate insights like sums, averages, counts, and perform comparisons across categories. Analysts use pivot tables for quick exploratory data analysis and reporting.
How would you explain the concept of correlation to a non-technical stakeholder?
Correlation measures how two variables move in relation to each other. If one variable tends to increase when the other increases, they have a positive correlation.
If one increases while the other decreases, they have a negative correlation. If there’s no pattern, correlation is near zero. It doesn’t imply causation, just a relationship pattern.
What is the difference between COUNT(*) and COUNT(column_name) in SQL?
- COUNT(*) counts all rows in a result set, including those with NULL values.
- COUNT(column_name) counts only the non-NULL values in that column.
Can you explain the difference between a WHERE clause and a HAVING clause in SQL?
- WHERE Clause: Filters rows before aggregation (GROUP BY). It cannot use aggregate functions directly.
- HAVING Clause: Filters groups after aggregation. It can use aggregate functions since it acts on grouped data.
- QUALIFY Clause (Extra Point): Used to filter the results of window functions. It operates similarly to WHERE and HAVING but specifically applies to the output of window functions, making it useful in scenarios involving analytical queries.
What is a CTE (Common Table Expression) in SQL, and why is it useful?
A CTE is a temporary named result set defined within the execution scope of a single SQL query. It improves query readability and can be referenced multiple times within the same query.
Example:
WITH RegionalSales AS (
SELECT region, SUM(sales) AS total_sales
FROM transactions
GROUP BY region
)
SELECT region, total_sales
FROM RegionalSales
WHERE total_sales > 100000;
How would you calculate the median in SQL?
SQL doesn’t have a built-in MEDIAN function in all dialects, so you often simulate it using window functions or by using percentile functions if available (e.g., PERCENTILE_CONT
in PostgreSQL or Oracle):
Example (PostgreSQL):
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) AS median_val
FROM table_name;
If not available, you might have to combine multiple queries or use a window function and carefully select the “middle” row(s).
How do you ensure data quality before analysis?
- Data Validation: Check for inconsistencies, duplicates, missing values.
- Data Profiling: Understand distributions, data types, and identify anomalies.
- Business Rules: Verify data against expected ranges and domain-specific rules.
- Automation of Checks: Use scripts to perform repetitive validation steps.
What tools do you use for data visualization?
Common data visualization tools include:
- BI Tools: Tableau, Power BI, QlikView
- Programming Libraries: matplotlib, seaborn, plotly in Python; ggplot2 in R
- Web-based Tools: D3.js for custom visualizations
Explain the concept of data cleansing.
Data cleansing (or cleaning) involves fixing or removing incorrect, corrupted, incomplete, or duplicate data within a dataset.
Methods include removing duplicates, correcting spelling errors, formatting dates, dealing with missing values, and ensuring consistency across the dataset. Clean data leads to more accurate and reliable analysis.
How would you convert data from long format to wide format in a tool like Excel or Python?
In Excel: Use Pivot Tables to pivot data from a long (tall) format into a wide format with categories as columns.
In Python (pandas):
import pandas as pd
df = pd.read_csv("long_data.csv")
df_wide = df.pivot(index='id', columns='category', values='value')
What is the difference between classification and regression?
- Classification: Predicts discrete categories (e.g., yes/no, class labels).
- Regression: Predicts continuous numerical values (e.g., sales amount, price).
A Data Analyst may not build predictive models often, but understanding these concepts is useful when working alongside Data Scientists.
How do you optimize a slow SQL query?
- Check Indexes: Ensure appropriate indexing on JOIN columns and filters.
- Rewrite Queries: Simplify complex subqueries, use CTEs for clarity, remove unnecessary functions in WHERE clauses.
- Analyze Execution Plan: Identify bottlenecks, such as table scans where indexes could help.
- Use Proper Joins: If possible, reduce the size of intermediate result sets by applying filters early.
What is data aggregation?
Data aggregation is the process of summarizing detailed data into a more compact form. Common aggregations include SUM, AVERAGE, COUNT, MIN, and MAX. This helps in understanding data at a higher level and supports decision-making.
How would you remove duplicate rows from a dataset using SQL?
A common way is to use a CTE with a window function (if your SQL flavor supports it):
WITH Ranked AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY unique_identifier ORDER BY id) AS rn
FROM table_name
)
DELETE FROM Ranked
WHERE rn > 1;
Alternatively, you can SELECT DISTINCT in queries that don’t require preserving the duplicates.
How do you calculate the percentage contribution of each category to the total?
Use a window function or a subquery to find the sum, then divide each category’s sum by the total:
SELECT
category,
SUM(value) AS category_sum,
SUM(value)*100.0 / (SELECT SUM(value) FROM sales) AS percentage
FROM sales
GROUP BY category;
Explain the concept of a data warehouse.
A data warehouse is a centralized repository for integrated, historical data from multiple sources. It’s designed for query and analysis rather than transaction processing.
It supports OLAP queries, enabling business users to perform complex analyses, historical trending, and reporting.
What is the purpose of using GROUP BY in SQL?
GROUP BY
is used to aggregate data across rows that share a common value in one or more columns. It allows you to perform aggregate calculations (SUM, COUNT, AVG) on subsets of data grouped by specified columns.
How do you interpret a box plot?
A box plot summarizes the distribution of a dataset:
- The box shows the interquartile range (IQR: Q1 to Q3).
- The line inside the box shows the median.
- Whiskers often show Q1 – 1.5IQR and Q3 + 1.5IQR. Points outside these whiskers are considered outliers.
How would you explain the importance of data storytelling?
Data storytelling involves presenting analytical insights in a clear, compelling narrative.
It helps non-technical stakeholders understand complex findings, provides context around numbers, and guides decision-making. Good data storytelling transforms raw data into meaningful, actionable insights.
Data Analyst Interview Questions for Experienced Levels
Describe your experience with dimensional modeling and designing star schemas.
Dimensional modeling organizes data into fact tables and dimension tables to facilitate OLAP queries.
A star schema has a central fact table (containing measurable data like sales or revenue) linked to dimension tables (like date, product, region) with foreign keys.
This structure reduces complexity, improves query performance, and is widely used in data warehouses.
How do you ensure data governance and data quality at scale?
Implementing robust data governance involves defining policies, standards, and processes for data management. This includes:
- Data Catalogs & Dictionaries: Documenting data definitions and metadata.
- Data Quality Checks: Automated validation rules, anomaly detection, and data quality dashboards.
- Access Controls & Compliance: Ensuring correct permissions, encryption, and adherence to regulations (e.g., GDPR).
Explain the use of window functions in SQL for advanced analytics.
Window functions perform calculations across sets of rows related to the current row without collapsing them into a single output row. They enable advanced analytics like running totals, moving averages, and ranking without the need for complex subqueries.
Example:
SELECT
date,
sales,
SUM(sales) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM sales_data;
How do you implement incremental data loading in a data warehouse?
Incremental loading involves loading only new or changed data since the last load. Techniques include:
- Timestamps or Row Versioning: Load rows with updated timestamps beyond the last load time.
- Change Data Capture (CDC): Track changes in source systems and apply only deltas.
- Comparison between last snapshot and current dataset: Insert new or updated rows, and optionally mark deleted rows.
Describe a situation where you had to optimize a complex SQL query in a production environment.
Typical steps:
- Analyze the Execution Plan to identify bottlenecks (full table scans, large sorts).
- Add or modify Indexes on frequently filtered columns.
- Simplify or refactor complex joins, reduce unnecessary subqueries.
- Use CTEs or temporary tables to break down the problem.
- Validate improvements using metrics like query run time before and after changes.
How do you handle data lineage and metadata management?
Data lineage tracks the origin, movement, and transformations of data over time. Metadata management ensures understanding of data definitions and transformations. Tools and strategies include:
- Automated lineage tracking tools (e.g., Collibra, Alation).
- Version control for ETL scripts.
- Detailed documentation in a data catalog.
Explain the concept of slowly changing dimensions (SCD) in data warehousing.
SCDs manage historical attribute changes in dimension tables:
- Type 1: Overwrite old values with new (no history).
- Type 2: Keep a record of all changes with start and end dates (tracks history).
- Type 3: Store previous value in a separate column (limited history).
How would you approach designing a data model for a large-scale analytics platform?
- Identify key business processes to model as facts (e.g., sales, transactions).
- Determine relevant dimensions (e.g., product, time, customer) and attributes.
- Apply dimensional modeling best practices (star/snowflake schemas).
- Ensure scalability, consider partitioning strategies, and indexing.
- Integrate data governance, security, and compliance requirements.
Have you worked with NoSQL databases? When would you choose NoSQL over SQL?
Yes, NoSQL databases (e.g., MongoDB, Cassandra) are chosen for:
- Schema-less or rapidly evolving data structures.
- High write throughput and horizontal scalability.
- Storing unstructured or semi-structured data like JSON.
- Use cases where complex joins are less common and flexible schema is preferred.
What is a data lake, and how is it different from a data warehouse?
A data lake is a centralized repository storing raw, unstructured, and structured data in its native format until needed. It’s highly scalable and cost-effective for big data.
A data warehouse, on the other hand, stores cleaned, structured, and curated data optimized for analytics and reporting.
Data lakes are more flexible; warehouses are more performance-tuned for structured queries.
How do you integrate machine learning predictions into your analytics workflow?
- Data Preparation: Ensure clean, feature-engineered data.
- Model Integration: Use APIs or scheduled batch jobs to generate predictions from ML models.
- Scoring and Post-Processing: Store model predictions alongside historical data for trend analysis.
- Visualization: Incorporate predicted values in dashboards, show forecasted vs. actual outcomes.
Explain the importance of index design and maintenance in large databases.
Indexes speed up data retrieval but can slow down writes and consume storage. Good index design:
- Creates indexes on frequently filtered columns.
- Avoids redundant or overly large indexes.
- Periodically rebuilds or reorganizes indexes to maintain performance.
- Balances read vs. write query performance requirements.
How would you manage and analyze streaming data?
- Use frameworks like Apache Kafka or Kinesis for data ingestion.
- Process data in real-time with tools like Apache Spark Structured Streaming or Flink.
- Store summaries in fast analytics stores for real-time dashboards.
- Apply windowed aggregations, filtering, and anomaly detection on streaming data.
Describe a challenging data cleaning process you conducted.
Example scenario: Consolidating data from multiple CRMs with different schemas. Steps taken:
- Profiling each source to identify data type mismatches, inconsistent encoding, and missing fields.
- Implementing standardized reference tables for country codes and product categories.
- Applying fuzzy matching to unify customer records.
- Creating automated scripts in Python to ensure repeatability and logging.
How do you handle compliance (e.g., GDPR) and sensitive data in your analysis?
- Data Masking/Anonymization: Remove or hash personally identifiable information.
- Least Privilege Access Controls: Grant only necessary permissions.
- Compliance Checks: Validate that data use adheres to GDPR or other regulations.
- Audit Trails & Encryption: Track data access and ensure secure data transmission.
How would you perform an A/B test analysis?
- Define hypothesis and metrics (e.g., conversion rate).
- Split traffic into control (A) and treatment (B) groups randomly.
- Collect data over a fixed period.
- Use statistical methods (t-tests or confidence intervals) to determine if differences are significant.
- Check for sample size adequacy, ensure no bias, and interpret results to guide decisions.
Discuss your experience with advanced Excel techniques (Power Query, Power Pivot, VBA).
- Power Query: For ETL tasks within Excel, automating data cleaning and transformations.
- Power Pivot: Create data models, relationships, and DAX measures for advanced analytics.
- VBA Macros: Automate repetitive tasks, create custom functions, and integrate with other Office applications.
How do you monitor and improve the performance of your dashboards?
- Optimize query performance behind visuals (indexes, pre-aggregations).
- Limit the complexity of visualizations and avoid overly large datasets in a single dashboard.
- Cache frequently accessed data.
- Use performance monitoring tools (e.g., built-in Tableau/Power BI performance analyzers).
Describe a scenario where you implemented data-driven decision-making and its business impact.
For example, analyzing customer churn data:
- Identified segments with high churn using descriptive statistics and trend analysis.
- Proposed targeted retention campaigns for these segments.
- Measured campaign effectiveness and reduced churn by a measurable percentage, increasing overall customer lifetime value.
How do you incorporate automation into your data analysis workflows?
- Schedule ETL jobs using tools like Airflow, Luigi, or cron jobs.
- Write Python or R scripts that run automatically to clean data and produce weekly reports.
- Leverage APIs for continuous data ingestion.
- Implement CI/CD pipelines for analytics code to ensure version control and reproducibility.
Explain data segmentation and its value in analytics.
Data segmentation is dividing a larger dataset into smaller groups (segments) based on certain characteristics (e.g., demographics, behavior).
It helps identify patterns, customize marketing strategies, tailor product recommendations, and improve overall decision-making by focusing on homogeneous groups.
How do you leverage SQL analytical functions for complex reporting?
Use analytic (window) functions like ROW_NUMBER()
, RANK()
, DENSE_RANK()
, LAG()
, LEAD()
, and NTILE()
to generate rankings, running totals, period-over-period comparisons, and percentile calculations directly in SQL, reducing the need for post-processing in another tool.
Describe your experience working with cloud-based data platforms (e.g., AWS, GCP, Azure).
- AWS: Experience with Redshift for warehousing, S3 for data lake storage, Glue for ETL, and QuickSight for BI.
- GCP: BigQuery for analytics at scale, Dataflow for streaming pipelines, Data Studio for reporting.
- Azure: Azure Synapse for integrated analytics, Azure Data Lake for storage, and Power BI for visualization.
How do you handle version control for SQL queries, dashboards, and analytical scripts?
- Use Git or other VCS to store SQL queries, Python scripts, and documentation.
- Maintain branches for development, staging, and production.
- Use tagging and releases for stable versions.
- Store BI dashboards’ metadata and configuration in code form if tools support it, or use export/import features.
What is the importance of caching in data analytics solutions?
Caching reduces query latency by storing frequently accessed results. Instead of repeatedly performing complex calculations or queries, you can serve data from a fast in-memory cache.
This improves user experience on dashboards and scales performance under high concurrent loads.
Explain how you would implement KPI dashboards to track business performance.
- Identify critical KPIs aligned with business objectives (e.g., monthly revenue, churn rate).
- Define data sources and ensure data quality.
- Build a star schema or use a semantic layer for consistent metric definitions.
- Use BI tools to visualize KPIs with appropriate charts, tables, and filters.
- Add trends, benchmarks, and alerts to prompt action when KPIs deviate from targets.
How do you approach time series forecasting and analysis?
- Explore historical data patterns: seasonality, trends, and cyclic behavior.
- Use statistical models (ARIMA, ETS) or machine learning models (Prophet, LSTM) depending on complexity.
- Validate models with backtesting and evaluate forecast accuracy using error metrics (MAE, RMSE).
- Incorporate external factors if needed, such as holidays or promotions.
Describe your strategy for conducting root cause analysis when a KPI drastically changes.
- Start by verifying data accuracy and ruling out data errors.
- Break down KPI by dimensions (e.g., product, region) to isolate the issue.
- Investigate related metrics to identify correlations or triggers.
- Conduct stakeholder interviews for contextual understanding.
- Document findings and propose actionable interventions.
How do you use advanced analytics techniques (clustering, PCA) in your daily work?
- Clustering: Segment customers or products into groups for targeted marketing or inventory strategies.
- PCA (Principal Component Analysis): Reduce dimensionality of a high-feature dataset while retaining most variance, simplifying analysis and visualization.
- Integrate these techniques using Python libraries (scikit-learn) and feed results into dashboards.
How do you stay updated with the latest tools, technologies, and best practices in data analytics?
- Regularly follow industry blogs, newsletters, and thought leaders (e.g., dbt Labs blog, Towards Data Science).
- Participate in webinars, online courses, and workshops.
- Engage in community forums (Stack Overflow, Reddit’s r/dataengineering).
- Experiment with new tools and libraries in personal projects.
Practical, code-based tasks
Data Cleaning: Remove Duplicates and Handle Missing Values
Task: You are given a dataset with duplicates and missing values. Remove duplicate rows and impute missing values with the median for numeric columns.
Code & Explanation:
import pandas as pd
import numpy as np
# Create example data
df = pd.DataFrame({
'id': [1, 2, 2, 3, 4, 4, 5],
'value': [10, np.nan, np.nan, 15, 10, 10, 20],
'category': ['A', 'B', 'B', 'C', 'C', 'C', 'A']
})
# Remove exact duplicate rows
df = df.drop_duplicates()
# Impute missing numeric values with median
for col in df.select_dtypes(include=[np.number]).columns:
median_val = df[col].median()
df[col].fillna(median_val, inplace=True)
print(df)
Explanation:
drop_duplicates()
removes identical rows.- For each numeric column, we compute the median and fill missing values.
- The final DataFrame is clean, with no duplicates or missing values.
Complex Filtering Conditions
Task: From a DataFrame of product sales, filter for rows where region='US'
, sales > 1000
, and category
is not 'Discontinued'
.
Code & Explanation:
df = pd.DataFrame({
'product': ['X', 'Y', 'Z', 'W'],
'region': ['US', 'EU', 'US', 'US'],
'sales': [1200, 900, 2000, 800],
'category': ['Active', 'Active', 'Discontinued', 'Active']
})
filtered = df[(df['region'] == 'US') & (df['sales'] > 1000) & (df['category'] != 'Discontinued')]
print(filtered)
Explanation:
- We chain boolean conditions using
&
and ensure we enclose each condition in parentheses. - The result will only include rows meeting all criteria.
Merging Two DataFrames With Different Join Types
Task: Merge a sales DataFrame with a products DataFrame. Perform both an inner join (only matching keys) and a left join (keep all from left), joining on the product_id
.
Code & Explanation:
sales = pd.DataFrame({
'sale_id': [1, 2, 3],
'product_id': [10, 20, 30],
'quantity': [2, 5, 1]
})
products = pd.DataFrame({
'product_id': [10, 20],
'product_name': ['Widget', 'Gadget'],
'price': [9.99, 19.99]
})
# Inner join
inner_joined = pd.merge(sales, products, on='product_id', how='inner')
print("Inner Join Result:\n", inner_joined)
# Left join
left_joined = pd.merge(sales, products, on='product_id', how='left')
print("Left Join Result:\n", left_joined)
Explanation:
how='inner'
returns only matching product_ids.how='left'
returns all rows fromsales
plus matched data fromproducts
, withNaN
where no match exists.
Using Window Functions (Rolling Averages)
Task: Given daily sales data, compute a 7-day rolling average of sales.
Code & Explanation:
dates = pd.date_range("2021-01-01", periods=10)
df = pd.DataFrame({
'date': dates,
'sales': [100, 200, 150, 300, 250, 400, 500, 600, 700, 800]
})
df['7_day_avg'] = df['sales'].rolling(window=7).mean()
print(df)
Explanation:
rolling(window=7).mean()
calculates the moving average of the previous 7 days.- This is akin to a window function, providing smooth trends rather than daily fluctuations.
Pivoting and Unpivoting (Melt)
Task: Convert a wide DataFrame with columns ['id', 'Q1_sales', 'Q2_sales']
to a long format with columns ['id', 'quarter', 'sales']
.
Code & Explanation:
df = pd.DataFrame({
'id': [101, 102, 103],
'Q1_sales': [200, 300, 250],
'Q2_sales': [400, 500, 450]
})
long_df = df.melt(id_vars='id', var_name='quarter', value_name='sales')
print(long_df)
Explanation:
melt()
reshapes data from wide to long format, making it easier to apply group-level aggregations or line plotting.
Grouping and Custom Aggregation
Task: Given transactions with store_id
, category
, and amount
, compute the total and average sales amount per store and category.
Code & Explanation:
transactions = pd.DataFrame({
'store_id': [1, 1, 1, 2, 2, 3],
'category': ['A', 'A', 'B', 'B', 'B', 'A'],
'amount': [100, 200, 300, 400, 500, 600]
})
agg_result = transactions.groupby(['store_id', 'category']).agg(
total_sales=('amount', 'sum'),
avg_sales=('amount', 'mean')
).reset_index()
print(agg_result)
Explanation:
groupby()
followed by.agg()
enables multiple aggregations in one go.- We specify named tuples (new_col_name=(‘old_col’, ‘agg_func’)) to define multiple aggregations.
Date/Time Manipulations
Task: Given a dataset with a timestamp
column as strings, convert it to datetime, extract the month and year, and compute monthly total sales.
Code & Explanation:
data = pd.DataFrame({
'timestamp': ['2021-01-15 10:00', '2021-01-20 12:00', '2021-02-10 14:00'],
'sales': [100, 150, 200]
})
data['timestamp'] = pd.to_datetime(data['timestamp'])
data['year_month'] = data['timestamp'].dt.to_period('M')
monthly_sales = data.groupby('year_month')['sales'].sum().reset_index()
print(monthly_sales)
Explanation:
pd.to_datetime
converts string to datetime..dt.to_period('M')
extracts the year-month period.- Grouping by this period provides monthly aggregations.
Reading Multiple CSV Files and Combining
Task: Assume you have two CSV files: data1.csv
and data2.csv
with similar structures. Combine them and find total sales.
Note: For demonstration, we’ll create two DataFrames and treat them as if read from CSV.
Code & Explanation:
df1 = pd.DataFrame({'id': [1,2], 'sales': [100,200]})
df2 = pd.DataFrame({'id': [3,4], 'sales': [300,400]})
combined = pd.concat([df1, df2], ignore_index=True)
total_sales = combined['sales'].sum()
print("Combined Data:\n", combined)
print("Total Sales:", total_sales)
Explanation:
pd.concat()
combines DataFrames row-wise.- We sum the
sales
column for the total.
Merging Hierarchical Data (Parent/Child)
Task: You have a hierarchical dataset of employees with employee_id
and manager_id
. Join the table to itself to get each employee’s manager name.
Code & Explanation:
employees = pd.DataFrame({
'employee_id': [1, 2, 3, 4],
'employee_name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'manager_id': [None, 1, 1, 2]
})
merged = employees.merge(employees, left_on='manager_id', right_on='employee_id', how='left', suffixes=('', '_mgr'))
res = merged[['employee_name', 'employee_name_mgr']]
print(res)
Explanation:
- Self-join: merge the table on
manager_id = employee_id
to fetch the manager’s name. suffixes
handles column name conflicts.
Apply a Custom Function Row-wise
Task: Given a dataset with cost
and tax_rate
columns, compute the final price by applying a custom function to each row.
Code & Explanation:
df = pd.DataFrame({
'cost': [100, 200, 300],
'tax_rate': [0.1, 0.2, 0.15]
})
def calculate_price(row):
return row['cost'] * (1 + row['tax_rate'])
df['final_price'] = df.apply(calculate_price, axis=1)
print(df)
Explanation:
df.apply()
executes a function on each row (axis=1).- The custom function computes final price using cost and tax_rate.
Ranking and Top N Analysis
Task: Given product sales, find the top 2 products by sales for each region.
Code & Explanation:
data = pd.DataFrame({
'product': ['P1','P2','P3','P4','P5','P6'],
'region': ['US','US','US','EU','EU','EU'],
'sales': [500, 600, 300, 400, 1000, 50]
})
data['rank'] = data.groupby('region')['sales'].rank(method='dense', ascending=False)
top_2 = data[data['rank'] <= 2]
print(top_2)
Explanation:
rank()
assigns ranks within each region group.- We filter to keep only those with rank <= 2.
Outlier Detection and Removal
Task: Identify and remove rows that have amount values beyond 3 standard deviations from the mean.
Code & Explanation:
vals = [10,12,13,14,15,100] # 100 is an outlier
df = pd.DataFrame({'amount': vals})
mean_val = df['amount'].mean()
std_val = df['amount'].std()
threshold = 3
filtered = df[(df['amount'] > mean_val - threshold*std_val) & (df['amount'] < mean_val + threshold*std_val)]
print(filtered)
Explanation:
- We compute mean and std, define a threshold (3 sigma rule), and keep only values within that range.
Normalization and Scaling
Task: Scale the amount
column to a 0-1 range (min-max normalization).
Code & Explanation:
data = pd.DataFrame({'amount':[100,200,400,800]})
amin, amax = data['amount'].min(), data['amount'].max()
data['normalized'] = (data['amount'] - amin) / (amax - amin)
print(data)
Explanation:
- Min-max scaling transforms data so the smallest value is 0 and the largest is 1, maintaining relative distances.
Creating Bins from Continuous Values
Task: Given scores, categorize them into bins: Low (0-50)
, Medium (51-80)
, High (81-100)
.
Code & Explanation:
scores = pd.DataFrame({'score':[10,50,75,95]})
bins = [0, 50, 80, 100]
labels = ['Low','Medium','High']
scores['category'] = pd.cut(scores['score'], bins=bins, labels=labels, include_lowest=True)
print(scores)
Explanation:
pd.cut()
segments continuous data into discrete bins based on defined intervals.
Basic Fuzzy Matching
Task: Given two lists of company names with slight spelling differences, perform a fuzzy match to find best matches.
Note: Requires fuzzywuzzy
library.
pip install fuzzywuzzy[speedup]
Code & Explanation:
from fuzzywuzzy import process
import pandas as pd
df_companies = pd.DataFrame({'company': ['Gooogle', 'Mocrosoft', 'Amazoon']})
valid_list = ['Google', 'Microsoft', 'Amazon']
def best_match(name, choices):
match, score = process.extractOne(name, choices)
return match if score > 80 else None
df_companies['matched'] = df_companies['company'].apply(lambda x: best_match(x, valid_list))
print(df_companies)
Explanation:
- We use
process.extractOne()
to find the best fuzzy match from a known list. - Thresholding on score ensures we only accept fairly close matches.
Calculating Month-over-Month Growth
Task: Given monthly revenue, calculate month-over-month percentage growth.
Code & Explanation:
months = pd.period_range("2021-01", periods=5, freq='M')
df = pd.DataFrame({'month': months, 'revenue': [1000,1100,1050,1200,1300]})
df['prev_revenue'] = df['revenue'].shift(1)
df['mom_growth'] = (df['revenue'] - df['prev_revenue']) / df['prev_revenue'] * 100
print(df)
Explanation:
shift(1)
retrieves the previous month’s revenue.- Calculate percentage change based on the previous month.
Correlation Matrix
Task: Compute correlation matrix for multiple numeric columns.
Code & Explanation:
df = pd.DataFrame({
'sales': [100,200,300,400],
'profit': [10,40,50,100],
'cost': [90,160,250,350]
})
corr_matrix = df.corr()
print(corr_matrix)
Explanation:
df.corr()
produces a Pearson correlation matrix to identify linear relationships among numeric columns.
Semi-Join and Anti-Join Simulation
Task: From a list of employees, find those who appear in a separate “active employee” list (semi-join) and those who do not (anti-join).
Code & Explanation:
all_emps = pd.DataFrame({'emp_id':[1,2,3,4,5]})
active_emps = pd.DataFrame({'emp_id':[2,4]})
# Semi-join: employees who are in active_emps
semi = all_emps[all_emps['emp_id'].isin(active_emps['emp_id'])]
# Anti-join: employees not in active_emps
anti = all_emps[~all_emps['emp_id'].isin(active_emps['emp_id'])]
print("Semi-join:\n", semi)
print("Anti-join:\n", anti)
Explanation:
isin()
filters rows.- Semi-join: intersection.
- Anti-join: complement set.
Memory Optimization with Categoricals
Task: Convert string columns to categorical to reduce memory usage.
Code & Explanation:
df = pd.DataFrame({
'country': ['USA','USA','France','Germany','Germany','Germany'],
'city': ['NY','LA','Paris','Berlin','Munich','Berlin']
})
df['country'] = df['country'].astype('category')
df['city'] = df['city'].astype('category')
print(df.info())
Explanation:
- Converting to categorical reduces memory usage for repeated strings.
Extracting Data from JSON Columns
Task: Given a JSON column, extract a specific field into a new column.
Code & Explanation:
import json
df = pd.DataFrame({
'id':[1,2],
'info':['{"name":"Alice","age":30}', '{"name":"Bob","age":25}']
})
df['info'] = df['info'].apply(json.loads)
df['name'] = df['info'].apply(lambda x: x['name'])
print(df)
Explanation:
- Convert JSON strings to dicts, then extract the
name
field.
Basic Linear Regression Using statsmodels
Task: Perform a simple linear regression of sales
on marketing_spend
.
Code & Explanation:
import statsmodels.api as sm
import numpy as np
df = pd.DataFrame({
'marketing_spend': [100,200,300,400,500],
'sales': [10,25,35,50,65]
})
X = df[['marketing_spend']]
X = sm.add_constant(X) # adds intercept
y = df['sales']
model = sm.OLS(y, X).fit()
print(model.summary())
Explanation:
- Using
statsmodels
, we build a linear model:sales ~ marketing_spend
. model.summary()
shows regression coefficients, p-values, and R-squared.
Using pivot_table for Complex Aggregations
Task: Create a pivot table of average amount
by region
and category
.
Code & Explanation:
data = pd.DataFrame({
'region':['US','US','EU','EU'],
'category':['A','B','A','B'],
'amount':[100,200,150,300]
})
pivot = pd.pivot_table(data, index='region', columns='category', values='amount', aggfunc='mean')
print(pivot)
Explanation:
pivot_table()
allows complex multi-dimensional aggregations easily.
Identifying and Removing Anomalies Using Z-score
Task: Remove rows where the value
column has a z-score greater than 3.
Code & Explanation:
from scipy.stats import zscore
data = pd.DataFrame({'value':[10,11,10,12,300]})
data['zscore'] = zscore(data['value'])
cleaned = data[abs(data['zscore']) <= 3]
print(cleaned)
Explanation:
zscore()
computes standard scores. Values beyond ±3 are considered anomalies.
Regex Filtering
Task: Filter rows where customer_name
starts with “J” and ends with “n”.
Code & Explanation:
df = pd.DataFrame({'customer_name':['John','Jason','Alice','Jack','Jill','Jen']})
filtered = df[df['customer_name'].str.match(r'^J.*n$')]
print(filtered)
Explanation:
str.match()
with regex^J.*n$
ensures names start with J and end with n.
Summary Statistics by Groups and Subgroups
Task: Compute the count, mean, and median of score
grouped by class
and gender
.
Code & Explanation:
df = pd.DataFrame({
'class':['A','A','B','B','B'],
'gender':['M','F','M','F','M'],
'score':[80,90,70,88,92]
})
stats = df.groupby(['class','gender'])['score'].agg(['count','mean','median']).reset_index()
print(stats)
Explanation:
- Multiple aggregations provide a broad statistical summary per subgroup.
Cumulative Sums and Rolling Averages
Task: Compute cumulative sum and 3-value rolling average of a numeric column.
Code & Explanation:
df = pd.DataFrame({'val':[10,20,30,40,50]})
df['cumulative_sum'] = df['val'].cumsum()
df['rolling_avg_3'] = df['val'].rolling(3).mean()
print(df)
Explanation:
cumsum()
provides a running total.rolling(3).mean()
gives a trailing 3-value average.
Working with Multi-Index DataFrames
Task: Create a multi-index DataFrame and select a subset using .loc[]
.
Code & Explanation:
arrays = [
['A','A','B','B'],
['X','Y','X','Y']
]
index = pd.MultiIndex.from_arrays(arrays, names=('letter','subgroup'))
df = pd.DataFrame({'value':[10,20,30,40]}, index=index)
subset = df.loc[('A', 'X')]
print(df)
print("Subset:\n", subset)
Explanation:
- MultiIndex allows hierarchical indexing.
.loc[('A','X')]
selects specific subgroup data.
Sorting by Multiple Columns and Ranking
Task: Sort by region
then by sales
descending, and rank the results within each region.
Code & Explanation:
df = pd.DataFrame({
'region':['US','US','EU','EU','EU'],
'sales':[100,250,300,200,100]
})
df_sorted = df.sort_values(['region','sales'], ascending=[True,False])
df_sorted['rank_in_region'] = df_sorted.groupby('region')['sales'].rank(method='first', ascending=False)
print(df_sorted)
Explanation:
- Sort by multiple keys and then rank within each group.
Combining Numeric and Categorical Operations (Pivot + Calculation)
Task: Pivot by region
and category
to get total sales, then add a column showing the percentage each category contributes to the region’s total.
Code & Explanation:
data = pd.DataFrame({
'region':['US','US','EU','EU'],
'category':['A','B','A','B'],
'sales':[100,300,200,400]
})
pivot = data.pivot_table(index='region', columns='category', values='sales', aggfunc='sum', fill_value=0)
pivot['total'] = pivot.sum(axis=1)
pivot['A_pct'] = pivot['A'] / pivot['total'] * 100
pivot['B_pct'] = pivot['B'] / pivot['total'] * 100
print(pivot)
Explanation:
- After pivoting, we compute totals and percentages per row.
Exporting Results to CSV and Excel
Task: After analysis, export the final DataFrame to both CSV and Excel formats.
Code & Explanation:
df = pd.DataFrame({'col1':[1,2,3], 'col2':[4,5,6]})
df.to_csv('output.csv', index=False)
df.to_excel('output.xlsx', index=False)
print("Data exported to CSV and Excel.")
Explanation:
to_csv()
andto_excel()
save the DataFrame.- For Excel, you may specify additional formatting or sheets if desired.
All of these tasks represent practical challenges senior Data Analysts may need to solve efficiently. The solutions demonstrate proficiency in data manipulation, cleaning, transformation, aggregation, and exporting results in Python with pandas
.
Data analysis Developer hiring resources
Our clients
Popular Data analysis Development questions
What role does Statistical Analysis play in Data Analysis?
Statistics is also a major division when analyzing data and has developed the most appropriate methods for the interpretation and making sense of data, best-suited for that purpose. It also assists the Analyst in identifying trends, correlations, and patterns in data, together with appraising the significance of these conclusions.
While doing this, Statistical Analysis, through testing hypotheses, regression, and probability distribution, offers the Analyst the opportunity to make objective decisions based on the insights the data conveys; it may even shed light on informed decisions arising from these insights.
How do Data Analysts ensure the accuracy and reliability of their findings?
Data Analysts provide verification of data sources, apply consistent methods, and conduct thorough checks at every stage in the analysis to ensure accuracy and reliability of results. They provide verification through different methods, test the statistical techniques to check for errors or biases, and apply the tools that will be helpful in their result provision.
Such periodic review and updating of the models ensure the continued accuracy of the latter. Also, the Analysts document all processes so that transparency and reproducibility of the results are warranted.
What are the 5 levels of Data Analysis?
Five different levels of Data Analyses include Descriptive, Diagnostic, Predictive, Prescriptive, and Cognitive. The first summarizes what has taken place, the second explains why it happened, the third projects future trends, and the fourth recommends actions to take under such circumstances based on those predictions. Level five uses AI to imitate human decision-making and continuously learns from your data.
What does a Data Analyst do?
A Data Analyst collects, cleans, and interprets data to help organizations make informed decisions. Put differently, they analyze a set of data for trends, patterns, and insights and turn their findings into visualizations and reports for communication. In the process, they engage teams in business strategies or solving particular problems at hand.
What are the 4 basic steps in Data Preparation and Data Analysis?
The major steps that are usually carried out in the process of Data Preparation and Analysis include collecting the data, cleaning errors and missing values, data transformation into suitable formats, and information analysis by statistical means so that different inferences can be arrived at. However, this step is meant to ensure that data is accurate and can be considered ready for utilization with reliability in a meaningful way.
What are the challenges in migrating data to new systems?
Challenges in moving to new systems involve ensuring that the data is compatible, that it remains intact, and offers only minimum downtime. In the overall process, such migrations, therefore, call for a proper and careful planning process on the part of engineers where data to go in and out of the new system is necessitated and validation that all was properly done. It is critical to handle large volumes of data with meticulous attention to the potential loss or corruption of data.
How do Data Engineers optimize database performance?
The Data Engineers make the database work more efficiently by applying indexing strategies, optimizing query performance, managing resources such as CPU and memory usage. They carry out partitioning techniques in order to distribute data across thousands of storage devices with reduced query times. Engineers are continuously monitoring database performance metrics, changing its configuration, and tuning queries where needed to keep speed and efficiency at the highest level.
What tools and technologies are essential for Data Engineering?
Among the most demanded, the top tools and technologies to work with are ETL tools, data processing frameworks, cloud platforms: Apache NiFi, Apache Talend, Apache Spark, Apache Hadoop, AWS, Google Cloud, and Azure. It is implemented with the help of PostgreSQL and MySQL as the database management system, MongoDB regarding NoSQL databases apart from Python, and SQL for scripting and querying data.
How do Data Engineers design and manage data pipelines?
With a proper mapping of how data will flow from the source to its destination, Data Engineers design and manage the data pipelines to make sure that this data is well processed.
Data Engineers use ETL, which is directed to as extraction, change, and load procedures, to harness data from diverse sources, transform it into a format that is usable, and then deliver it into storage systems. These Engineers keep a close eye on the performance of the pipelines, automate common tasks, and put in place error-handling mechanisms that ensure smooth operations.
What is the difference between a Data Engineer and a Data Analyst?
A Data Engineer designs the architecture necessary for data collection, processing, and storage, comprising the design of data pipelines and databases. A Data Analyst is someone who interprets the data given by Engineers for analysis, deriving meaningful insights, report writing, and supporting decisions. Where Data Engineers ensure that data is available and reliable, data analysts transform that data into actionability for the business.
Do Data Engineers use SQL?
Yes, Data Engineers make extensive use of SQL. SQL is one of the basic tools for a Data Engineer through which one can query, manipulate, and take care of data stored in relational databases.
More generally, Data Engineers do SQL, like setting and optimization of database structure, extraction and transformation of data, creation of data pipelines, or efficient storing and retrieving of data. One of the core competencies in the treatment of large volumes of data is their integration from different sources into data warehouses or other kinds of storage.
Interview Questions by role
Interview Questions by skill
Interview Questions
Interview Questions
Interview Questions
Interview Questions
Interview Questions
Interview Questions
Interview Questions
Interview Questions
Interview Questions
Interview Questions
Interview Questions
Interview Questions
Interview Questions
Interview Questions