Level Up Your Data Game: 15 Pandas Functions Data Scientists Need to Master for Interviews [Part 2]
From aggregations and joins to time series transformations and feature engineering, this guide covers the must-know techniques to ace your next data science interview! πΌπ
π Hey! This is Manisha Arora from PrepVector. Welcome to the Tech Growth Series, a newsletter that aims to bridge the gap between academic knowledge and practical aspects of data science. My goal is to simplify complicated data concepts, share my perspectives on the latest trends, and share my learnings from building and leading data teams.
Having mastered the fundamentals of data ingestion, cleaning, and basic selection, we're now ready to unlock the true power of Pandas for more complex data manipulation. This installment covers advanced techniques like combining disparate datasets, summarizing information from groups, reshaping data for specific analyses, and, most importantly, the art of feature engineering β creating new, impactful insights from your existing data. These are the skills that transform raw numbers into valuable signals for your AI/ML models.
π Data Aggregation & Summarization (Making Sense of Groups)
This is where you begin to derive higher-level insights from your raw data. By grouping data based on common characteristics and then summarizing it, you can identify trends, patterns, and key performance indicators. It's about turning raw data into meaningful insights.
πΉ .groupby()
/ .agg()
β Aggregation and Summarization
These are the workhorses for summarizing data by categories.
df.groupby('column_name') # Groups the DataFrame by one or more columns.
.agg({'col_to_agg': 'function'}) # Applies aggregation functions (e.g., 'sum', 'mean', 'count', 'min', 'max') to specified columns.
Use Cases:
Calculating total sales per product.
Finding average prices by city.
Counting unique users per region.
Example:
data = {
'order_id': [101, 102, 103, 104, 105, 106, 107, 108],
'product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Monitor', 'Mouse', 'Laptop', 'Keyboard'],
'quantity': [1, 2, 1, 1, 1, 3, 2, 1],
'price': [1200, 25, 75, 1200, 300, 25, 1200, 75],
'city': ['Bangalore', 'Mumbai', 'Bangalore', 'Delhi', 'Mumbai', 'Delhi', 'Bangalore', 'Mumbai'],
'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05', '2024-01-06', '2024-01-07', '2024-01-08']
}
sales_df = pd.DataFrame(data)
# Total quantity sold per product
product_sales = sales_df.groupby('product')['quantity'].sum()
# Multiple aggregations by city
city_stats = sales_df.groupby('city').agg(
total_price=('price', 'sum'),
avg_quantity=('quantity', 'mean'),
num_transactions=('product', 'count')
)
Explanation: groupby
followed by agg
is a very common pattern in data analysis. It allows you to split data into groups, apply a function to each group, and combine the results into a meaningful summary.
π .merge()
/ .join()
/ pd.concat()
β Combining DataFrames
Real-world data often lives in multiple files or tables. These functions enable you to combine different DataFrames based on common keys or by stacking them together.
pd.merge(df1, df2, on='key_col', how='inner/left/right/outer') # Like SQL joins
df1.join(df2, on='key_col', how='left') # Index-based join
pd.concat([df1, df2], axis=0/1) # Stack DataFrames
Use Cases:
Joining customer data with order data.
Combining quarterly sales reports into a single annual report.
Example:
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['A', 'B', 'C']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'score': [85, 90, 78]})
merged_df = pd.merge(df1, df2, on='id', how='inner')
df_concat1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df_concat2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
concatenated_df = pd.concat([df_concat1, df_concat2], axis=0)
Explanation: merge
is your go-to for relational joins. join
is a convenience method for index-based joins. concat
is for stacking or appending DataFrames directly.
π Data Transformation & Reshaping (Molding Data)
This category includes functions that allow you to apply custom logic to your data, change its format from long to wide, or perform element-wise transformations.
π .apply()
/ .map()
/ .applymap()
β Element-wise and Row/Column-wise Operations
Use Cases:
Feature engineering
Data transformation (unit conversion, string cleaning)
Normalizing or standardizing values
Example:
grades_df = pd.DataFrame({
'student': ['Alice', 'Bob', 'Charlie'],
'score': [85, 92, 78],
'grade_letter': ['B', 'A', 'C']
})
grades_df['score_normalized'] = grades_df['score'].apply(lambda x: x / 100)
grade_mapping = {'A': 'Excellent', 'B': 'Good', 'C': 'Average'}
grades_df['grade_description'] = grades_df['grade_letter'].map(grade_mapping)
grades_df_num_percent = grades_df[['score']].apply(lambda col: col / 100, axis=0)
df.sort_values(by='column', ascending=True/False)
Explanation:
apply
: versatile and flexiblemap
: great for dictionary or lookup-based transformationsapplymap
: element-wise across a whole DataFrame (less common)
π .sort_values()
β Sorting Data
df.sort_values(by='column', ascending=True/False)
Use Cases:
Top-N analysis
Chronological ordering
Example:
sorted_by_price = sales_df.sort_values(by='price', ascending=False)
sorted_by_product_then_price = sales_df.sort_values(by=['product', 'price'], ascending=[True, False])
Shameless plugs:
Master Product Sense and AB Testing, and learn to use statistical methods to drive product growth. I focus on inculcating a problem-solving mindset, and application of data-driven strategies, including A/B Testing, ML, and Causal Inference, to drive product growth.
AI/ML Projects for Data Professionals
Gain hands-on experience and build a portfolio of industry AI/ML projects. Scope ML Projects, get stakeholder buy-in, and execute the workflow from data exploration to model deployment. You will learn to use coding best practices to solve end-to-end AI and ML Projects to showcase to the employer or clients.
π .pivot_table()
β Reshaping Data for Analysis
pd.pivot_table(df, values='col', index='row', columns='col', aggfunc='mean')
Use Cases:
Multi-dimensional summaries
Creating clean reporting tables
Example:
pivot_table_df = pd.pivot_table(
sales_df,
values='quantity',
index='city',
columns='product',
aggfunc='sum'
)
π§ͺ Feature Engineering & Time Series Specific Operations (Creating Value)
These functions are critical when crafting new variables for your ML models, especially with temporal data.
π§± .cut()
/ .qcut()
β Binning Continuous Data
pd.cut(series, bins=[...], labels=[...])
pd.qcut(series, q=4, labels=[...])
Use Cases:
Categorizing numeric features
Quantile-based bucketing
Example:
pytscores = pd.Series([65, 72, 88, 95, 50, 78, 81, 99])
score_bins = pd.cut(scores, bins=[0, 70, 85, 100], labels=['Fail', 'Pass', 'Excellent'])
score_quantiles = pd.qcut(scores, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
hon
𧬠.astype()
β Changing Data Types
series.astype(dtype)
df.astype({'col': 'dtype'})
Use Cases:
Converting strings to numbers
Optimizing memory usage
Enabling categorical operations
Example:
df_types = pd.DataFrame({
'id': ['1', '2', '3'],
'value': [10.5, 20.0, 30.1],
'is_active': [True, False, True]
})
df_types['id'] = df_types['id'].astype(int)
df_types['value'] = df_types['value'].astype('float32')
df_types['is_active'] = df_types['is_active'].astype('category')
β± .resample()
β Time Series Resampling
df.resample('W').sum()
df.resample('3D').mean()
Use Cases:
Daily to weekly aggregates
Trend analysis
Temporal smoothing
Example:
ts_df = pd.DataFrame({'value': values}, index=dates)
weekly_sum = ts_df.resample('W').sum()
three_day_mean = ts_df.resample('3D').mean()
π
.rank()
β Ranking Values
series.rank(method='average')
CopyEdit
series.rank(method='average')
Use Cases:
Leaderboards
Percentile features
Example:
df_rank['rank_avg'] = df_rank['score'].rank(method='average')
df_rank['rank_dense'] = df_rank['score'].rank(method='dense')
df_rank['rank_max'] = df_rank['score'].rank(method='max')
β© .shift()
/ .diff()
/ .rolling()
β Time Series Feature Engineering
Use Cases:
Lag/lead features
Rate of change
Moving averages
Example:
ts_df['value_lag_1'] = ts_df['value'].shift(1)
ts_df['value_diff'] = ts_df['value'].diff(1)
ts_df['rolling_mean_3d'] = ts_df['value'].rolling(window=3).mean()
β
Conclusion
Mastering these Pandas functions will significantly boost your confidence and performance in data science and machine learning interviews. They cover the breadth of common data manipulation tasks, from loading and cleaning to transforming and aggregating. Practice them, understand their nuances, and you'll be well on your way to becoming a Pandas pro! πΌ
Happy data wrangling! πͺπ
Upcoming Courses:
Master Product Sense and AB Testing, and learn to use statistical methods to drive product growth. I focus on inculcating a problem-solving mindset, and application of data-driven strategies, including A/B Testing, ML, and Causal Inference, to drive product growth.
AI/ML Projects for Data Professionals
Gain hands-on experience and build a portfolio of industry AI/ML projects. Scope ML Projects, get stakeholder buy-in, and execute the workflow from data exploration to model deployment. You will learn to use coding best practices to solve end-to-end AI/ML Projects to showcase to the employer or clients.
Not sure which course aligns with your goals? Send me a message on LinkedIn with your background and aspirations, and I'll help you find the best fit for your journey.