What Would I Have Done Differently (and Will Do Differently Next Time)?
A lot was gained throughout this journey and much of what I learned was from things I did wrong. Learn from your mistakes. As such, I’ll walk through some points of development that I would reconsider in future development:
- Bootstrapping
- Testing
- Authentication and Authorization
- Pub/Sub and Processing
- Monitoring and Alerting
Bootstrapping
Throughout much of my development experience I emphasized certain choices to minimize development time so long as whatever was being sacrificed was worth it. In the case of the front end web development, much of my time was spent creating custom components, messing around with the CSS more than I would’ve like, especially to make some components responsive — big shoutout to Flexbox.
What I didn’t realize at the time is that there’s a popular framework called React Bootstrap, modeled after Javascript’s Bootstrap by Twitter, that comes strapped with an extensive library of prebuilt components. While I enjoyed the fact that I could build the front end as almost an exact replica of the Figma user flow design, I most certainly would’ve sacrificed some functionality and looks for development time and responsiveness. In fact, I probably could’ve saved myself weeks. You live, you learn.
Testing
Testing, while quite the generic concept, could not be more important to proper development. With regards to testing, I’ll speak mostly to the back end and exchange as that’s where most of the logic resides.
For the most part here’s what testing should look like:
In my system I stopped short on testing. While I had a good base of unit tests that executed as part of the CI, all integration testing was done manually and a lot of development time was spent in the AWS console itself. This felt great at first because I was able to get to testing quicker and move on to the next component faster, but in the long-term I was essentially shooting myself in the foot. My integration testing wasn’t programmatic, didn’t execute on CI, and didn’t protect against regression.
Let’s get to the root of the issue though, as an earlier key decision was made that made integration testing exceedingly difficult. Let’s go over some requirements of integration testing which will help to expose where I went wrong:
- Tests run against an actual environment to see if integration are hooked up properly
- Tests should succeed independent of the state of the environment or be idempotent
- Tests should not corrupt the state of your development environment
Often a logical conclusion of this requirements is that integration testing should run against an ephemeral environment, or an environment meant to last for a limited amount of time. Now, to create an ephemeral environment for the entire back end shouldn’t be necessary when you want to test a single service in a microservice architecture. Thus, the idea should be to have built services independently in such a way that you can stand them up independently as well — this was my problem. While my application code was located in separate repos and the CI ran independently, I thought it was a grand idea to make my infrastructure a monolith, allowing myself to add additional infra without all the boilerplate of a new repo and Terraform Cloud workspace. Plot twist — it wasn’t a grand idea. I wasn’t able to spin up an ephemeral environment without spinning up all of the infrastructure, something I most definitely wanted to avoid doing for the sake of time, cost, and facing issues around trying to stand up so much infrastructure at once. This was again another point at which I shot myself in the foot which then led to some cascading technical debt that made proper integration testing more difficult. Again, something I would certainly do differently next time.
In summary, if you’re going to build things independently of one another, actually make them independent of one another, including the infrastructure. Secondly, write programmatic integration tests that run against an ephemeral environment during CI to save yourself time in the long-run.
Authentication and Authorization
I didn’t know much about authentication and authorization, having worked primarily in big data systems, however, the extent of the knowledge I did have was from my web systems course. As I understood, store users and passwords in a database on creation, making sure to hash and salt the password. Easy enough, right?
The next hurdle was figuring out how to handle authorization, especially for a server side single page applications, or SPA. After some light reading, it seemed like the API would pass back a JSON Web Token, or JWT, that the user would store locally and pass in the header of the API request to verify access to protected API routes. The concept made sense so I created guid’s in the API on login and stored it with a TTL in DynamoDb, passing it back to the user and storing it in LocalStorage on the client side. I also created my own decorator for the API development for protected routes that would check the token against the Dynamo table.
def login_required(f):
"""
Decorator to require user to have token and user in header
:return: boolean for if they are logged in
"""
def wrap(request: Request):
if (not request.headers or 'token' not in request.headers
or 'uid' not in request.headers):
return Response(body={"error": "Not authorized"}, status_code=403) uid = request.headers['uid']
token = request.headers['token']
dynamo = DynamoDb()
verified_session = dynamo.check_token(uid=uid, token=token)
if verified_session:
return f(request)
return Response(body={"error": "Not authorized"}, status_code=403)
return wrap
class DynamoDb:
def __init__(self):
self.db = boto3.resource('dynamodb', endpoint_url=os.environ.get("ENDPOINT_URL", None))
self.token_table_name = os.environ["TOKEN_TABLE_NAME"] def check_token(self, uid: int, token: str) -> bool:
"""
Checks dynamo to see if user exists in keys and if token matches
:param uid: user id
:param token: generated token on login
:return: bool of whether the token matches
"""
table = self.db.Table(self.token_table_name)
epoch_time_now = int(time.time())
try:
response = table.query(
KeyConditionExpression=Key('Uid').eq(uid),
FilterExpression=Key('TimeToLive').gt(str(epoch_time_now))
)
if len(response.get('Items', [])) > 0 and response['Items'][0]['Token'] == token:
return True
return False
except ClientError as e:
return False def insert_token(self, uid: int, token: str):
"""
Insert new token into dynamo table
:param uid: user id
:param token: token for session
"""
table = self.db.Table(self.token_table_name)
ttl = datetime.datetime.today() + datetime.timedelta(minutes=30)
expiry_datetime = int(time.mktime(ttl.timetuple())) try:
table.put_item(
Item={
'Uid': uid,
'Token': token,
'TimeToLive': str(expiry_datetime)
}
)
except Exception as e:
print(e)
While this worked, what I was missing was a lot of best practices and tooling around authentication and authorization that would’ve handled much of this for me in addition to protected pages on the front end. In essence, I spent more development time and got less functionality than I otherwise would’ve.
Introducing, AWS Amplify and AWS Cognito!
Amazon Cognito lets you add user sign-up, sign-in, and access control to your web and mobile apps quickly and easily. Amazon Cognito scales to millions of users and supports sign-in with social identity providers, such as Apple, Facebook, Google, and Amazon
AWS Amplify is a set of tools and services that can be used together or on their own, to help front-end web and mobile developers build scalable full stack applications, powered by AWS. With Amplify, you can configure app backends and connect your app in minutes, deploy static web apps in a few clicks, and easily manage app content outside the AWS console.
Using AWS Cognito, I could configure users and user pools to integrate directly with IAM. API Gateway also had the ability to control access to the REST API using Cognito user pools. Additionally, AWS Amplify provides pre-built UI components for React for signing up and signing on. Even better, it also provides an easy way to wrap your protected routes and verify against Cognito.
The lesson here was to use tooling to my advantage especially when it comes to security. Additionally, while school taught me the basics, it’s fair to assume there’s always a better way, especially with the speed at which technology evolves today.
Pub/Sub and Processing
I’ve mentioned throughout this article that I’ve selected Kinesis to connect services throughout the architecture, using NerdWallet’s Python Kinesis Consumer library to process records. Additionally, I took advantage of DynamoDb for integrity checks to achieve exactly once processing semantics.
In retrospect, there were more downsides than originally anticipated that stemmed from Kinesis with Python for processing:
- Processing speeds
- Additionally complexity for achieving transactional semantics
- Bugs not caught till runtime resulting in additional unit testing or manual integration testing
- Ambiguity in scaling the codebase for additional complexity
Much of these concerns could’ve been addressed by consuming from Kafka using Java with the Spring Framework (Spring Boot, Spring Data, Spring Cloud, Spring Caching, etc.) and Kafka Transactions API. This would’ve kept development patterns more structured, caught bugs at compile time, and increased processing speeds. Additionally, I would’ve strongly considered venturing out of AWS into Confluent Cloud for Kafka management as it provided two features that delivered great value:
- Kafka Schema Registry to ensure inter-service communication adhered to the contract between services
- Kafka GUI for managing and monitoring Kafka
All in all, this would’ve been an uphill battle as there was much I was unfamiliar with, but one that would’ve yielded a lot of learning, especially for future projects.
Monitoring and Alerting
My logs are being flushed to CloudWatch, I should be fine, right? I mean the product works, we can launch it and figure out monitoring later, right? Wrong. Let’s discuss some impacts of monitoring as an afterthought:
- Issues aren’t detected, causing SLA breaches, outages, etc.
- Issues can’t be responded to fast
- Redesigns and refactoring may be needed to account for monitoring requirements
- Operations are interrupted
Essentially, lack of metrics, monitoring, and alerting introduces a lot of heavy work as the only way to ensure the system is working properly is to constantly check the state of the system. The end result could be catastrophic, crippling your system and users’ trust in it.
In retrospect, I would’ve addressed monitoring functions and requirements during the design phase which would’ve guided my logging and metrics efforts during implementation. The end result — confidence that the system is working as intended.
0 Comments