Level Up Your Data Game: 15 Pandas Functions Data Scientists Need to Master for Interviews [Part 2]

From aggregations and joins to time series transformations and feature engineering, this guide covers the must-know techniques to ace your next data science interview! 💼📊

Manisha Arora

Udit Manav

, and

Sai Kumar Bysani

Jun 19, 2025

👋 Hey! This is Manisha Arora from PrepVector. Welcome to the Tech Growth Series, a newsletter that aims to bridge the gap between academic knowledge and practical aspects of data science. My goal is to simplify complicated data concepts, share my perspectives on the latest trends, and share my learnings from building and leading data teams.

Having mastered the fundamentals of data ingestion, cleaning, and basic selection, we're now ready to unlock the true power of Pandas for more complex data manipulation. This installment covers advanced techniques like combining disparate datasets, summarizing information from groups, reshaping data for specific analyses, and, most importantly, the art of feature engineering – creating new, impactful insights from your existing data. These are the skills that transform raw numbers into valuable signals for your AI/ML models.

📊 Data Aggregation & Summarization (Making Sense of Groups)

This is where you begin to derive higher-level insights from your raw data. By grouping data based on common characteristics and then summarizing it, you can identify trends, patterns, and key performance indicators. It's about turning raw data into meaningful insights.

🔹 `.groupby()` / `.agg()` — Aggregation and Summarization

These are the workhorses for summarizing data by categories.

df.groupby('column_name')  # Groups the DataFrame by one or more columns.
.agg({'col_to_agg': 'function'})  # Applies aggregation functions (e.g., 'sum', 'mean', 'count', 'min', 'max') to specified columns.

Use Cases:

Calculating total sales per product.
Finding average prices by city.
Counting unique users per region.

Example:

data = {
    'order_id': [101, 102, 103, 104, 105, 106, 107, 108],
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Monitor', 'Mouse', 'Laptop', 'Keyboard'],
    'quantity': [1, 2, 1, 1, 1, 3, 2, 1],
    'price': [1200, 25, 75, 1200, 300, 25, 1200, 75],
    'city': ['Bangalore', 'Mumbai', 'Bangalore', 'Delhi', 'Mumbai', 'Delhi', 'Bangalore', 'Mumbai'],
    'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05', '2024-01-06', '2024-01-07', '2024-01-08']
}
sales_df = pd.DataFrame(data)

# Total quantity sold per product
product_sales = sales_df.groupby('product')['quantity'].sum()

# Multiple aggregations by city
city_stats = sales_df.groupby('city').agg(
    total_price=('price', 'sum'),
    avg_quantity=('quantity', 'mean'),
    num_transactions=('product', 'count')
)

Explanation: groupby followed by agg is a very common pattern in data analysis. It allows you to split data into groups, apply a function to each group, and combine the results into a meaningful summary.

🔗 `.merge()` / `.join()` / `pd.concat()` — Combining DataFrames

Real-world data often lives in multiple files or tables. These functions enable you to combine different DataFrames based on common keys or by stacking them together.

pd.merge(df1, df2, on='key_col', how='inner/left/right/outer')  # Like SQL joins
df1.join(df2, on='key_col', how='left')  # Index-based join
pd.concat([df1, df2], axis=0/1)  # Stack DataFrames

Use Cases:

Joining customer data with order data.
Combining quarterly sales reports into a single annual report.

Example:

df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['A', 'B', 'C']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'score': [85, 90, 78]})

merged_df = pd.merge(df1, df2, on='id', how='inner')

df_concat1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df_concat2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
concatenated_df = pd.concat([df_concat1, df_concat2], axis=0)

Explanation: merge is your go-to for relational joins. join is a convenience method for index-based joins. concat is for stacking or appending DataFrames directly.

🔄 Data Transformation & Reshaping (Molding Data)

This category includes functions that allow you to apply custom logic to your data, change its format from long to wide, or perform element-wise transformations.

🛠 `.apply()` / `.map()` / `.applymap()` — Element-wise and Row/Column-wise Operations

Use Cases:

Feature engineering
Data transformation (unit conversion, string cleaning)
Normalizing or standardizing values

Example:

grades_df = pd.DataFrame({
    'student': ['Alice', 'Bob', 'Charlie'],
    'score': [85, 92, 78],
    'grade_letter': ['B', 'A', 'C']
})

grades_df['score_normalized'] = grades_df['score'].apply(lambda x: x / 100)

grade_mapping = {'A': 'Excellent', 'B': 'Good', 'C': 'Average'}
grades_df['grade_description'] = grades_df['grade_letter'].map(grade_mapping)

grades_df_num_percent = grades_df[['score']].apply(lambda col: col / 100, axis=0)
df.sort_values(by='column', ascending=True/False)

Explanation:

apply: versatile and flexible
map: great for dictionary or lookup-based transformations
applymap: element-wise across a whole DataFrame (less common)

📐 `.sort_values()` — Sorting Data

df.sort_values(by='column', ascending=True/False)

Use Cases:

Top-N analysis
Chronological ordering

Example:

sorted_by_price = sales_df.sort_values(by='price', ascending=False)
sorted_by_product_then_price = sales_df.sort_values(by=['product', 'price'], ascending=[True, False])

Shameless plugs:

Product Data Science
Master Product Sense and AB Testing, and learn to use statistical methods to drive product growth. I focus on inculcating a problem-solving mindset, and application of data-driven strategies, including A/B Testing, ML, and Causal Inference, to drive product growth.

Check Out Product DS Course

AI/ML Projects for Data Professionals
Gain hands-on experience and build a portfolio of industry AI/ML projects. Scope ML Projects, get stakeholder buy-in, and execute the workflow from data exploration to model deployment. You will learn to use coding best practices to solve end-to-end AI and ML Projects to showcase to the employer or clients.

🔁 `.pivot_table()` — Reshaping Data for Analysis

pd.pivot_table(df, values='col', index='row', columns='col', aggfunc='mean')

Use Cases:

Multi-dimensional summaries
Creating clean reporting tables

Example:

pivot_table_df = pd.pivot_table(
    sales_df,
    values='quantity',
    index='city',
    columns='product',
    aggfunc='sum'
)

🧪 Feature Engineering & Time Series Specific Operations (Creating Value)

These functions are critical when crafting new variables for your ML models, especially with temporal data.

🧱 `.cut()` / `.qcut()` — Binning Continuous Data

pd.cut(series, bins=[...], labels=[...])
pd.qcut(series, q=4, labels=[...])

Use Cases:

Categorizing numeric features
Quantile-based bucketing

Example:

pytscores = pd.Series([65, 72, 88, 95, 50, 78, 81, 99])

score_bins = pd.cut(scores, bins=[0, 70, 85, 100], labels=['Fail', 'Pass', 'Excellent'])
score_quantiles = pd.qcut(scores, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
hon

🧬 `.astype()` — Changing Data Types

series.astype(dtype)
df.astype({'col': 'dtype'})

Use Cases:

Converting strings to numbers
Optimizing memory usage
Enabling categorical operations

Example:

df_types = pd.DataFrame({
    'id': ['1', '2', '3'],
    'value': [10.5, 20.0, 30.1],
    'is_active': [True, False, True]
})

df_types['id'] = df_types['id'].astype(int)
df_types['value'] = df_types['value'].astype('float32')
df_types['is_active'] = df_types['is_active'].astype('category')

⏱ `.resample()` — Time Series Resampling

df.resample('W').sum()
df.resample('3D').mean()

Use Cases:

Daily to weekly aggregates
Trend analysis
Temporal smoothing

Example:

ts_df = pd.DataFrame({'value': values}, index=dates)

weekly_sum = ts_df.resample('W').sum()
three_day_mean = ts_df.resample('3D').mean()

🏅 `.rank()` — Ranking Values

series.rank(method='average')

CopyEdit

series.rank(method='average')

Use Cases:

Leaderboards
Percentile features

Example:

df_rank['rank_avg'] = df_rank['score'].rank(method='average')
df_rank['rank_dense'] = df_rank['score'].rank(method='dense')
df_rank['rank_max'] = df_rank['score'].rank(method='max')

⏩ `.shift()` / `.diff()` / `.rolling()` — Time Series Feature Engineering

Use Cases:

Lag/lead features
Rate of change
Moving averages

Example:

ts_df['value_lag_1'] = ts_df['value'].shift(1)
ts_df['value_diff'] = ts_df['value'].diff(1)
ts_df['rolling_mean_3d'] = ts_df['value'].rolling(window=3).mean()

✅ Conclusion

Mastering these Pandas functions will significantly boost your confidence and performance in data science and machine learning interviews. They cover the breadth of common data manipulation tasks, from loading and cleaning to transforming and aggregating. Practice them, understand their nuances, and you'll be well on your way to becoming a Pandas pro! 🐼

Happy data wrangling! 💪📈

Upcoming Courses:

Product Data Science
Master Product Sense and AB Testing, and learn to use statistical methods to drive product growth. I focus on inculcating a problem-solving mindset, and application of data-driven strategies, including A/B Testing, ML, and Causal Inference, to drive product growth.

Check Out Product DS Course

AI/ML Projects for Data Professionals
Gain hands-on experience and build a portfolio of industry AI/ML projects. Scope ML Projects, get stakeholder buy-in, and execute the workflow from data exploration to model deployment. You will learn to use coding best practices to solve end-to-end AI/ML Projects to showcase to the employer or clients.

Check Out AI/ML Projects Course

Not sure which course aligns with your goals? Send me a message on LinkedIn with your background and aspirations, and I'll help you find the best fit for your journey.

A guest post by

Udit Manav

A guest post by

Sai Kumar Bysani

The Data Hustle is here to help you navigate the data world, find opportunities, and grow in your career. I’m Sai Bysani, a Lead Data Analyst at BCBS with an MS in Data Science from UConn. Join the 135K+ LinkedIn family for career tips and insights!

Level Up Your Data Game: 15 Pandas Functions Data Scientists Need to Master for Interviews [Part 2]

From aggregations and joins to time series transformations and feature engineering, this guide covers the must-know techniques to ace your next data science interview! 💼📊

📊 Data Aggregation & Summarization (Making Sense of Groups)

🔹 .groupby() / .agg() — Aggregation and Summarization

🔗 .merge() / .join() / pd.concat() — Combining DataFrames

🔄 Data Transformation & Reshaping (Molding Data)

🛠 .apply() / .map() / .applymap() — Element-wise and Row/Column-wise Operations

📐 .sort_values() — Sorting Data

🔁 .pivot_table() — Reshaping Data for Analysis

🧪 Feature Engineering & Time Series Specific Operations (Creating Value)

🧱 .cut() / .qcut() — Binning Continuous Data

🧬 .astype() — Changing Data Types

⏱ .resample() — Time Series Resampling

🏅 .rank() — Ranking Values

⏩ .shift() / .diff() / .rolling() — Time Series Feature Engineering

✅ Conclusion

Discussion about this post

🔹 `.groupby()` / `.agg()` — Aggregation and Summarization

🔗 `.merge()` / `.join()` / `pd.concat()` — Combining DataFrames

🛠 `.apply()` / `.map()` / `.applymap()` — Element-wise and Row/Column-wise Operations

📐 `.sort_values()` — Sorting Data

🔁 `.pivot_table()` — Reshaping Data for Analysis

🧱 `.cut()` / `.qcut()` — Binning Continuous Data

🧬 `.astype()` — Changing Data Types

⏱ `.resample()` — Time Series Resampling

🏅 `.rank()` — Ranking Values

⏩ `.shift()` / `.diff()` / `.rolling()` — Time Series Feature Engineering