𝗕𝘂𝗶𝗹𝗱 𝗠𝗮𝗰𝗵𝗶𝗻𝗲 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹 𝘄𝗶𝘁𝗵 𝗖𝗵𝗮𝘁𝗚𝗣𝗧 : 𝗜𝗱𝗲𝗻𝘁𝗶𝗳𝘆 𝗵𝗶𝗴𝗵 𝗽𝗿𝗼𝗽𝗲𝗻𝘀𝗶𝘁𝘆 𝗹𝗲𝗮𝗱𝘀 𝗳𝗼𝗿 𝗰𝗮𝗺𝗽𝗮𝗶𝗴𝗻𝘀 𝙃𝙞𝙜𝙝 𝙄𝙢𝙥𝙖𝙘𝙩 𝙊𝙥𝙥𝙤𝙧𝙩𝙪𝙣𝙞𝙩𝙮 𝙛𝙤𝙧: 𝘼𝙄 𝘼𝙪𝙩𝙤𝙢𝙖𝙩𝙞𝙤𝙣 𝘼𝙜𝙚𝙣𝙘𝙞𝙚𝙨, 𝙈𝙖𝙧𝙠𝙚𝙩𝙞𝙣𝙜 𝙖𝙜𝙚𝙣𝙘𝙞𝙚𝙨, 𝙇𝙚𝙖𝙙 𝙂𝙚𝙣 𝙖𝙜𝙚𝙣𝙘𝙞𝙚𝙨, 𝙎𝙩𝙖𝙧𝙩𝙪𝙥𝙨, 𝙈𝙞𝙘𝙧𝙤-𝙎𝙢𝙖𝙡𝙡-𝙈𝙚𝙙𝙞𝙪𝙢 𝙀𝙣𝙩𝙚𝙧𝙥𝙧𝙞𝙨𝙚𝙨 (𝙈𝙎𝙈𝙀) Was working on an analysis project involving model build. Using GPT and Bard as coding co-pilots. Started to wonder if GPT (GPT Plus) would be able to handle a full model build with just prompts and instructions. Amazingly, yes, but with some caveats and constraints. Check out my video to see how it works. Quick and concise video below. In depth video on my YouTube channel [https://lnkd.in/g579dyMM] 𝗣𝗿𝗼𝗺𝗽𝘁𝘀 Shared in the comments. Will vary from case to case. Customize as necessary. Also there in YouTube description... probably easier to copy from there. 𝗨𝘀𝗲 𝗖𝗮𝘀𝗲𝘀 1. Small datasets, low complexity models: Build end-to-end with GPT. 2. Large datasets, complex models: Share a small sample, get code, run on your platform, iterate with GPT with results and code. 3. Data engineering – modelling dataset: This is the biggest piece in the model build pipeline. Share sample data for modeling cleaning, run code on your platform, iterate. 𝗧𝗶𝗽𝘀 & 𝗧𝗿𝗶𝗰𝗸𝘀 1. Know GPT Limits: Crashes with high complexity models and larger datasets. Play with data/ models to gauge. 2. Start with low complexity: Calibrate hyperparameters slowly if the model is not robust. e.g., start with just 30 trees and depth of only 3 for random forest. 3. Check assumptions and review work: e.g., once it dropped 30% of my pop as outliers. 4. Tends to overfit models: Give specific instructions and keep an eye out. 5. Model metrics: Can share Confusion Matrix / Precision-Recall-Accuracy / others. Request the one you need. 6. Explanatory variables: Some like Feature Importance are easy for GPT, but tends to crash with others like Partial Dependency Plots. Get the code, run it yourself. Use Google Colab T4 GPU for intensive tasks. Has free limits. 7. Decile Table: Tends to sort it in reverse order; keep an eye out. 8. Timing: Runs faster in off-hours (US). I have seen a 3-5X difference 𝗗𝗮𝘁𝗮 𝗦𝗲𝗰𝘂𝗿𝗶𝘁𝘆 1. PI Data: Anonymize or drop. 2. Uploaded File Security: Use sample data or scrambled data. 3. Uploaded files easily hacked on GPT Store GPT’. See my LinkedIn post for more information on hacking & countermeasures. Not yet heard of uploaded files from user conversations being hacked. It's an evolving area, so need to be mindful. https://lnkd.in/gwe6zg5g 𝗖𝗼𝗻𝘀𝗶𝗱𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀 On a live project, data engineering and creating a modeling dataset account for ~80% of the model build effort. Implementation factors also play a significant role. This post and video focuses on model building aspect
...𝗣𝗿𝗼𝗺𝗽𝘁𝟭 ... 𝗰𝗼𝗻𝘁𝗶𝗻𝘂𝗮𝘁𝗶𝗼𝗻 ... For rest of this conversation, please keep all your responses, intermediate responses and updates: brief, curt and concise. Nothing verbose. But make sure to share the important points. Test/ Train Split / Treatment of Missing - Outliers - Duplicates/ Model Used. / Model Metrics as mentioned above, etc. Keep all details handy for creating detailed documentation later. Keep all codes also handy as i would need that for scoring the full base separately. {If model results are not good then tweak hyperparameters and ask ChatGPT to run it again.}
𝗕𝗮𝘀𝗲 𝗣𝗿𝗼𝗺𝗽𝘁𝘀 Go in sequence else chance of ChatGPT erroring out. Modify prompts as per your specific use case. This might not be the best option for all propensity models. 𝗣𝗿𝗼𝗺𝗽𝘁#𝟭. Analyze the provided campaign dataset: preprocess it, then build and validate a propensity model on training and testing sets. Take a best judgement call on missing values, outliers and junk data in records. Check for duplicates. Use random forest. Special check for overfitting. If overfitting, then reduce model complexity so that test and trainings align very close. Run as many iterations as needed for that. Start with less-complex model hyperparameters as per below. n_estimators: start wtih 30 trees max_depth : start wtih 3 max_features: start with "log2" min_samples_split: start with 50 min_samples_leaf: start wtih 50 Report model metrics (ROC-AUC, Gini coefficient) for both test and training. Keep the test and training datasets ready for further analysis. ....𝗣𝗿𝗼𝗺𝗽𝘁𝟭 .. 𝗰𝗼𝗻𝘁𝗶𝗻𝘂𝗲𝗱 𝗶𝗻 𝗻𝗲𝘅𝘁 𝗰𝗼𝗺𝗺𝗲𝗻𝘁... 𝘀𝗲𝗲 𝗮𝗯𝗼𝘃𝗲...
This is amazing! Amar Harolikar
Very succinct and helpful Amar Harolikar
Love this !! Thanks for sharing Amar!!
Specialist: Decision Sciences & Applied Gen AI
1y𝗣𝗿𝗼𝗺𝗽𝘁#𝟮. Provide decile table for test and train. CSV format. Side by side. Keep Decile Number, Count of Records, Number of Responders, Average Probability 𝗣𝗿𝗼𝗺𝗽𝘁#𝟯. Feature Importance score: CSV format 𝗣𝗿𝗼𝗺𝗽𝘁#𝟰. Score the dataset and share original dataset with score. 𝗣𝗿𝗼𝗺𝗽𝘁#𝟱. Provide full code that i can use to build and score my main base separately. The main base has a million records. Make sure to include the following amongst other things: Test-Train - Model Build, Scoring Code to score main base, Code patch for deciling (output to CSV in local temp runtime google colab directory), code for feature importance output to csv My dataset file path is filepath='/content/drive/MyDrive/xxx/BANK_1M_M.csv' The data structures is exactly the same. Give me a code that i can directly copy paste and use.