There are many things to consider when handling and normalizing data. Since I love use cases as well as edge cases I figured it might be interesting to list out a few thoughts around ingesting data as well as handling data.
No matter if you are a company of one or operate a billion dollar business, data is something everyone wants or has a need for.
Data can be defined in many ways, but the broad handling of data can be summed up in 3 ways
- Manual - Ability for a person to send you data e.g. a web form
- Semi-Automatic - Ability for a person to upload a file - e.g a CSV or JSON file
- Automatic - Ability for real-time updates e.g. API or Webhook
Manual - e.g. Web Form
Easy to create and allows anyone to fill out and complete.
Things to consider:
- Validation - Handling data that may require localization and internationalization. That simple 10 digit phone number collector now needs to handle things like extensions and country codes
- Sanitization - Some people LOVE to WRITE in UPPERCASE or lowercase or SCREAMING_SNAKE_CASE 🐍 🤦♂️
- Similar Data or the need for “cloning” - When your web form asks for the same data and someone is having to enter it in by hand with slight variations (e.g. 100+ addresses where the city and state are the same, but the street address and name vary)
Semi-Automatic - e.g. CSV File
Always provide a CSV template, you can thank me later.
Things to consider:
- Column Names - First Name vs fname vs first_name 🤷♂️
- Merging Changes - How do you handle if the data already exists, is it checked or do you just overwrite if the data is already there
- Preview Changes - allowing a user to verify and check their data before inserting/updating it as well as a simple sanity check that what they are changing is what they want
- Roll Back - When you upload a CSV only to realize you had a typeo that was copy/paste into the same field for 100+ rows… how can you easily “undo” this 🤔
- Ordering - Was first name the first column or the second column in your file 👀 Can you easily match up the columns based on what you think and let the user confirm?
Automatic - e.g. API
Just let the bot do it 🤖
Things to consider:
- Authentication - Are you using API keys, if so can you roll the API keys and do you have separate test and production keys? If not, maybe you are using oAuth and now need to store the refresh tokens and keep them “fresh” or now everyone just got logged out
- Throttling - Do you allow anyone to go full hog 🐷 on your service or do you limit request or provide feedback on how soon they can kickoff another request or how many API request they have left
The role of data from both ingest to display can mean many things to many people. You can also combine things like semi-automatic (e.g. CVS Upload) that then generates an API that can be used both internally and externally for rapid prototyping or data migration.
Lastly I want to touch on the part of data validation.
How do you know your data is “right” or do you assume the data is bad, but better than nothing? 🤔 Say a customer uploads a database of addresses (e.g. dentist offices) and you run them through some type of usps checker or address verification tool (e.g. SmartyStreets - not picking on them, I love their product, just saying!)
You could assume that you always want to go with smartysteets data when they are different, but in some cases you have office addresses that are old and may have moved so a bad address in, even corrected by smartystreets could mean for wrong/bad data in both datasets. The other case could be that smartystreets dataset is off (for now, could be updated later) and since the office is a brand new building with a street that did not exists a month ago, the address you have is correct and should ignore the verification process.
Data is hard and keeping it up to date can be a challenge into itself.