Microservices is a distributed systems architecture.
- Use of small, autonomous services
- Services collaborate to complete use cases
- Microservices are primarily modeled around business domains
- Microservices have independent life-cycles
- The fundamental design of a system should make it easy to change
Of a distributed system:
- Gives development teams increased autonomy
- Different parts of the system can be scaled independently
- Easy to build a resilient system, such that one one server/service going down does not bring down the entire system
- Composability of services
Of microservices, specifically:
- Each service can use the technology most appropriate to it (Technology Heterogeneity)
- Lower risk when trying out new technologies - You can apply it to one service at a time
- Lower risk deployments - one service instead of the entire system (supports good Continuous Deployment practices)
- Strong code boundaries can align with team boundaries
(Based on Visual Studio 2019 functionality)
In a monolithic application, existing IDE tools can identify every place in the project that a Type/Method/etc is used.
With microservices, we are dealing with many small applications which are still coupled together through API calls or a message bus or something. Existing IDE tools cannot locate everywhere the code is coupled because it treats each application/project as a stand alone piece of code.
I am reduced to using grep/findstr to verify where an API call or message is being used.
This problem may indicate that the microservices have not been divided correctly. Reorganizing the division of tasks may mean that the microservices to not need to call each other, thereby removing the runtime coupling.
I would only consider this a problem between domain-services. It is ok for domain-services to be coupled to utility-services (such as a logging service), because utility-services should be very stable.
Is message bus an anti-pattern?
A message bus or event queue here refers to a system that you publish messages to, and that subscribers can read messages from. And it is used with microservices to provide asynchronous reactions within your system.
This is good for messages that can arrive out of order and can sometimes be dropped.
This is bad for mission-critical messages. And that's how I'm seeing it used at multiple companies.
- What if the messages are processed out of order?
- Because one errored out and came back into the queue for a retry behind a later message
- Because multiple processes are picking up messages in parallel
- What if a message errors out the maximum number of times?
- Why is it so hard to just find out what messages have been sent and responded to?
An earlier iteration of this idea was to add records to a database, queried by subscribers, and updated with responses. What was wrong with that approach? You have long-term records (and data storage is cheap) of exactly what happened in an easy-to-access location.
What is the ideal size for a microservice?
Rules of thumb:
- (Jon Eaves) it could be rewritten in two weeks
- (Sam Newman) A small development team can manage the whole thing
- (Sam Newman) You don't think it's too big
- (Sam Newman) A few hundred lines long
Each microservice owns and maintains a private database. Other services can only access this data by querying the owner service.
This keeps the services loosely coupled.
Each service can use the data storage method that best suits them.
This does not have to be implemented as an entire database per service. It could be a set of tables per service, or a schema per service, etc.
Given the Database Per Service pattern, how do you ensure transactional integrity when an update affects multiple services?
A saga is a sequence of local transactions. As each transaction completes (updates the service's database) it publishes a message that triggers the next service in the process to run their transaction. If any step fails, the saga executes a reverse series of steps that undo (or compensate for) the changes that were already saved.
Given the Database Per Service pattern, how do you handle queries that require joining data across services?
Implement an API Composer, which is a service that you can query, which manages collecting the data from several other services and joining it all together.
This may required in-memory operations on large data sets, which will be inefficient.
CQRS stands for Command Query Responsibility Segregation.
Given the Database Per Service pattern, how do you handle queries that require joining data across services?
Implement a read-only database that replicates the production data. Bring in data from all your various sources. Now queries can be run against this database.
An API Gateway is a public API that stands between clients and microservices. The clients only call the API Gateway, and the API Gateway calls whichever microservice the request should be delegated to.
See Facade and Adapter Patterns.
An extension of API Gateway.
Create one API Gateway for each frontend client.
Notes from reading "Building Microservices" by Sam Newman.
Shout out for Eric Evans's "Domain Driven Design": the importance of representing the real world in our code.
Shout out for Alistair Cockburn's Hexagonal Architecture design pattern: guiding us away from layered architectures where business logic can hide.
"Domain driven design. Continuous delivery. On-demand virtualization. Infrastructure automation. Small autonomous teams. Systems at scale. Microservices have emerged from this world."
Microservices are "small, and focused on doing one thing well."
Monolithic codebases generally lose their ideal structures. "Code related to similar functions starts to become spread all over, making fixing bugs or implementations more difficult."
(I've seen this happen with microservices as well. You can't replace developer discipline with architecture.)
"Cohesion - the drive to have related code grouped together...This is reinforced by Robert C. Martin's definition of the Single Responsibility Principle, which states gather together those things that change for the same reason, and separate those things that change for different reasons."
Follow the Single Responsibility Principle by aligning service boundaries with business boundaries. "Making it obvious where code lives for a given piece of functionality."
"How small is small? Jon Eaves characterizes a microservice as something that could be rewritten in two weeks, a rule of thumb that makes sense for his particular context."
"As you get smaller, the benefits around interdependence increase. But so too does some of the complexity that emerges from having more and more moving parts...As you get better at handling this complexity, you can strive for smaller and smaller services."
Rule of thumb is to deploy one service to one machine (or virtual machine). "Although this solution can add some overhead, the resulting simplicity makes our distributed system much easier to reason about."
"All communication between the services themselves are via network calls, to enforce separation between the services and avoid the perils of tight coupling."
- can be changed independently of each other
- can be deployed without requiring consumers to change
"If there is too much sharing, our consuming services become coupled to our internal representations. This decreases our autonomy."
"Our service exposes an API (application programming interface), and collaborating services communicate with us via those APIs."
"The golden rule: can you make a change to a service and deploy it by itself without changing anything else?"
"With a system composed of multiple, collaborating services, we can decide to use different technologies inside each one."
Regarding using multiple technology stacks: "Just like many things concerning microservices, it's all about finding the right balance."
Example of the scope of microservices: "Gilt, an online fashion retailer, ...today has over 450 microservices, each one running on multiple separate machines."
(A lot of this chapter went straight into the "Strengths" of microservices section)
Indication of microservice size: "How often have you deleted more than a hundred lines of code in a single day and not worried too much about it? With microservices often being of similar size, the barriers to rewriting or removing services entirely are very low...When a codebase is just a few hundred lines long, it is difficult for people to become emotionally attached to it, and the cost of replacing it is pretty small."
"There is a lack of good consensus on how to do SOA well. In my opinion, much of the industry has failed to look holistically enough at the problem and present a compelling alternative to the narrative set out by various vendors in this space...Many of the problems laid at the door of SOA are actually problems with things like communication protocols (e.g. SOAP), vendor middleware, a lack of guidance about service granularity, or the wrong guidance on picking places to split your system...Much of the conventional wisdom around SOA doesn't hlep you understand how to split something big into something small. It doesn't talk about how big is too big."
"The microservice approach has emerged fro real-world use, taking our better understanding of systems and architecture to do SOA well."
"Technically, it should be possible to create well-factored, independent modules within a single monolithic process. And yet we rarely see this happen. The modules themselves soon become tightly coupled with the rest of the code, surrendering one of their key benefits. Having a process boundary separation does enforce clean hygiene in this respect."
"[Microservices are no silver bullet]...They have all the associated complexity of distributed systems...If you're coming from a monolithic system point of view, you'll have to get much better at handling deployment, testing, and monitoring to unlock the benefits we've covered so far."
The Evolutionary Architect
"Our industry is a young one. This is something we seem to forget, and yet we have only been creating programs that run on what we recognize as computers for around 70 years. Therefore, we are constantly looking to other professions in an attempt to explain what we do...We aren't medical doctors or engineers, but nor are we plumbers or electricians. Instead, we fall into some middle ground, which makes it hard for society to understand us, or for us to understand where we fit."
"Perhaps the term 'architect' has done the most harm. The idea of someone who draws up detailed plans for others to interpret, and expects this to be carried out. The balance of part artist, part engineer, overseeing the creation of what is normally a singular vision, with all other viewpoints being subservient, except for the occasional objection from the structural engineer regarding the laws of physics."
"In our industry, this view of the architect leads to some terrible practices...page after page of documentation, created with a view to inform the construction of the perfect system, without taking into account the fundamentally unknowable future. Utterly devoid of any understanding as to how hard it will be to implement, or whether or not it will actually work, let alone having any ability to change as we learn more."
"Our requirements shift more rapidly than they do for people who design and build buildings - as do the tools and techniques at our disposal."
Since software requirements are always changing, architects need to focus on designing a system that can change.
Comparison of software architecture to city planning, where you can designate zoning rules but you don't specify specifically where or how each building is built.
"Rather than worrying too much about what happens in one zone, the town planner will instead spend far more time working out how people and utilities move from one zone to another."
"We cannot foresee everything that will happen, and so rather than plan for any eventuality, we should plan to allow for change by avoiding the urge to over-specify every last thing."
zones => groups of services (see Domain Driven Design: large-scale structures)
"As architects, we need to worry much less about what happens inside the zone than what happens between the zones."
(aside) "I cannot emphasize how important it is for the architect to actually sit with the team! This is significantly more effective than having a call or just looking at her code...It should be a routine activity."
"Making decisions in system design is all about trade-offs, and microservice architecture gives us lots of trade-offs to make! When picking a datastore, do we pick a platform that we have less experience with, but that gives us better scaling? Is it ok for us to have two different technology stacks in our system? What about three?"
"Framing here can help, and a great way to help frame our decision making is to define a set of principles and practices that guide it, based on goals that we are trying to achieve."
Strategic Goals: high level goals of the business.
Principles: "Principles are rules you have made in order to align what you are doing to some larger goal, and will sometimes change."
"For example, if one of your strategic goals as an organization is to decrease the time to market for new features, you may define a principle that says that delivery teams have full control over the lifecycle of their software to ship whenever they are ready, independently of any other team."
Recommendation: have fewer than 10 principles - it can fit on a poster, people can remember them, they don't contradict each other
[Example: Heroku's 12 principles for using their platform]
Constraint: something that is very hard, or impossible, to change.
"Personally, I think there can be some value in keeping [Constraints and Principles] in the same list to encourage challenging constraints every now and then and see if they really are immovable!"
Practices: "Our practices are how we ensure our principles are being carried out. They are a set of detailed, practical guidelines for performing tasks...Low-level enough that any developer can understand them."
Ex: coding guidelines, use HTTP/RESTfor integration, use central logging
"Practices should underpin our principles."
Some standard things services should support:
Monitoring: a system-wide view of the health of our services.
"To make this as easy as possible, I would suggest ensuring that all services emit health and general monitoring-related metrics in the same way."
Interfaces: use one (or very few) interface technologies. This makes is easier to integrate new consumers.
"If you pick HTTP/REST, will you use verbs or nouns? How will you handle pagination of resources? How will you handle versioning of end points?"
Architectural Safety: don't allow one bad service to take down the whole system.
Exemplars: create sample, runnable code that exemplifies the practices you want the team to use.
"Ideally, these would be real-world services you have that get things right, rather than isolated services that are just implemented to be perfect examples...By ensuring that your exemplars are actually being used, you ensure that all principles you have actually make sense."
Tailored Service Template: create service templates that have all the standard stuff already setup.
"You do have to be careful that creating the service template doesn't become the job of a central tools or architecture team who dictates how things should be done, albeit via code. Defining the practices you use should be a collective activity, so ideally your team(s) should take joint responsibility for updating this template."
Ease of use should be the guiding force. Don't build up a framework monstrosity.
Be careful of shared code, which is coupling. Some companies copy-paste the service template instead of using a shared library.
Technical Debt: we can't always get everything done the way we want it immediately, so there is a balance in what we can complete now and what we have to put off until later. How you manage this debt varies from company to company.
Exception Handling: (poorly chosen name - he means making exceptions to principles and practices) Keep track of these exceptions - if enough of the same type stack up, it indicates a change should be made to your practices.
From COBIT (Control Objectives for Information and Related Technology)
"Governance ensures that enterprise objectives are achieved by evaluating stakeholder needs, conditions, and options; setting direction through prioritization and decision making; and monitoring performance, compliance, and progress against agreed-on direction and objectives."
"If one of the architect's jobs is ensuring there is a technical vision, then governance is about ensuring what we are building matches this vision, and evolving the vision if needed."
Architects are responsible for:
- Vision: ensure there are a set of principles derived from the business strategy
- Empathy: ensure they are not making developers miserable; understand the impact of your decisions on your colleagues
- Collaboration: help others grow, get their buy-in, spread the load of leadership
- Adaptability: keep up-to-date with new technologies; change your vision as the situation changes
- Autonomy: find the balance between standardization and letting teams decide
- Governance: ensure the system matches the technical vision; make trade-off decisions
"Normally, governance is a group activity...This group needs to be led by technologist, and to consist predominantly of people who are executing the work being governed. This group should also be responsible for tracking and managing technical risks."
"The architect is responsible for making sure the [governance] group works, but the group as a whole is responsible for governance...This shares the load, and ensures that there is a higher level of buy-in...and that information flows freely from the teams into the group."
What if the group disagrees with the architect?
Author recommends siding with the group. "The group is often much wiser than the individual, and I've been proven wrong more than once! And imagine how disempowering it can be for a group to have been given space to come up with a decision, and then ultimately be ignored."
"But sometimes I have overruled the group. But why, and when?" Comparison to bike riding - it's ok to let the kid fall down, but not to let them veer into traffic.
"Much of the role of the technical leader is about helping grow them [the developers] - helping them understand the vision themselves - and ensuring that they can be active participants in shaping and implementing the vision."
"With larger, monolithic systems, there are fewer opportunities for people to step up and own something. With microservices, on the other hand, we have multiple autonomous codebases that will have their own independent lifecycles. Helping people step up by having them take ownership of individual services before accepting more responsibility can be a great way to help them achieve their own career goals..."
"I am a strong believer that great software comes from great people. If you worry only about the technology side of the equation, you're missing way more than half of the picture."
"The worst reaction to all these forces that push us toward change is to become more rigid or fixed in our thinking."
How to Model Services
What makes a good service? Loose coupling between services and high cohesion within services.
Loose Coupling: "A change to one service should not require a change to another...A loosely coupled service knows as little as it needs to about the services with which it collaborates...Chatty communication can lead to tight coupling."
High Cohesion: "We want related behavior to sit together, and unrelated behavior to sit elsewhere...If we want to change behavior, we want to be able to change it in one place...Making changes in lots of different places is slower, and deploying lots of services at once is risky."
Bounded Context: see Domain Driven Design
A bounded context is like the encapsulation of an object. There are private data and operations inside the boundary. There is a public interface that other services can interact with.
"A specific responsibility enforced by explicit boundaries."
"These bounded contexts lend themselves extremely well to being compositional boundaries...In general, microservices should cleanly align to bounded contexts."
Shared and Hidden Models:
The internal model of a bounded context may well not be the external model it shares with other services.
And models with the same name in different bounded contexts can mean very different things or subtly different things.
"By thinking clearly about what models should be shared, and not sharing our internal representations, we avoid one of the potential pitfalls that can result in tight coupling."
Modules and Services:
"When starting out, keep a new system on the more monolithic side; getting service boundaries wrong can be costly..."
I.e. start with Modules within a monolithic system, and gradually break them out into Services.
Premature Decomposition: anecdote about a team needing to merge all their microservices back into a monolithic system due to unforeseen complications with their CI tool. Then they figured out better boundaries to break it apart on.
- I'm not sure what the CI tool had to do with it.
Business Capabilities: "Ask first 'what does this context do?' and then 'What data does it need to do that?'" Don't determine bounded contexts based on shared data - this is usually incorrect.
"When modeled as services, these [business] capabilities become the key operations that will be exposed over the wire to other collaborators."
From the single responsibility principle, also consider what forces/departments will cause a module/service to have to change.
You may well end up with nested bounded contexts. Maybe the first level determines service boundaries, and the second level determines modules within a service. Or the second level is also a set of services, but they are entirely hidden behind the first service.
The decision of how and when to break out microservices will often depend on your organization. If a different team is in charge of it, it probably needs to be a separate service.
The decision of how and when to break out microservices can also depend on automated test capabilities.
Communication in Terms of Business Concepts: "If our systems are decomposed along the bounded contexts that represent our domain, the [business] changes we want to make are more likely to be isolated to one, single microservice boundary. This reduces the number of places we need to make the change, and allows us to deploy that change quickly."
"The same terms and ideas that are shared between parts of your organization should be reflected in your interfaces."
"It can be useful to look at what can go wrong when services are modeled incorrectly."
Anecdote about a company that split their service by layers (front end, back end, data access) instead of vertical slices. And also did not abstract business use cases from the backend, they just spoke in low-level database terms everywhere.
"Making decisions to model services boundaries along technical seams isn't always wrong (see performance issues)...However, it should be your secondary driver for finding these seams, not your primary one."
Ch 4: Integration
Avoid breaking changes - changes that force your consumers to change.
Keep your APIs technology-agnostic - technology changes rapidly. Avoid integration technologies that dictate what tech stacks we can use to implement our microservices.
Make you service simple for consumers - freedom on tech choice is good, but providing a client library can ease adoption (but they increase coupling).
Hide internal implementation details - we don't want our consumers to be bound to our internal implementation (that turns more changes into breaking changes).
Interfacing with customers
Shared Database - consumers have direct access to the database
In short, do not do this. It is the tightest coupling possible.
Everyone is tied to the exact same model, and to one type of database.
There is no encapsulation of domain object behavior - each consumer spreads the behavior around.
Synchronous VS Asynchronous
"Should communications be synchronous or asynchronous? This fundamental choice inevitably guides us toward certain implementation details."
Synchronous: a call is made to a remote server, which blocks until the operation completes.
Easier to reason about.
Enables Request/Response communication.
Asynchronous: the caller doesn't wait for the operation to complete before returning. The caller may not care if the operation ever happens at all.
Useful for long running jobs.
Provides low latency. Can keep a UI responsive even when the network is laggy.
But the technology is more complicated.
Additional work is required for monitoring/tracking that everything is happening correctly.
Enables Event-Based communication.
Request/Response: a client initiates a request and waits for a response.
This is usually done synchronously, but can be done asynchronously with callbacks.
Event-Based: a client reports that something has happened, and listeners take action based on that event.
Business logic spreads out more evenly, rather than being centrally located. This can result in higher cohesion.
Orchestration VS Choreography
Orchestration: a central system tells others when to run. See Request/Response.
"Simpler, and we get to know if things worked straight away."
"Tends to be brittle, with a higher cost of change."
Choreography: each system decides when to take action. See Event-Based.
"In general, I have found that systems that tend more towards the choreographed approach are more loosely coupled, and are more flexible and amenable to change."
"Each service is smart enough to understand it's role in the whole dance."
Remote Procedure Call (RPC)
"Refers to the technology of making a local call and having it execute on a remove service somewhere."
May rely on having an interface definition (see SOAP, Thrift, protocol buffers)
"The use of separate interface definitions can make it easier to generate client and server stubs for diff tech stacks." (See SOAP with WSDL)
"Other tech, like Java RMI, calls for a tighter coupling between the client and server, requiring that both use the same underlying technology."
The core characteristic is that a remove call looks like a local call.
Ease of use is a selling point.
Can become coupled to particular technology.
May hide that a call is remote too much - you need to know what calls will be significantly slower.
Some RPC solutions are very brittle. Changes to a Java RMI solution are always breaking changes.
REpresentational State Transfer (REST)
Usually used over HTTP
"How a resource is shown externally is completely decoupled from how it is stored internally." - This can be done as badly with REST and with any other architecture. And there is nothing inherent to RPC that makes the public contracts match internal models.
Uses HTTP verbs in standard ways - easier to learn a new interface.
HTTP is well supported with many tools.
Downside - HTTP is not the most efficient of formats
Hypermedia As The Engine Of Application State (HATEOAS)
An extension of REST
Resources return links to related resources and actions (state transitions). The links have unchanging semantic labels, which frees the links themselves to change without breaking the consumer.
"By following the links, the client gets to progressively discover the API, which can be a really handy capability when we are implementing new clients."
"Using these controls to decouple the client and server yields significant benefits over time that greatly offset the small increase in the time it takes to get these protocols up and running."
Downside - can be very chatty, as the client crawls through the network of links.
Downside - not well supported by tools yet - you'll have to roll your own servers and clients
JSON, XML, or other
JSON is very popular right now - succinct, but lacks semantic tags of XML
Hypertext Application Language (HAL) attempts to add hyperlinking to JSON
XML is supported by XPath, which allows you to find tags even when they have moved around the data structure - more resilient clients
Beware Too Much Convenience
for example, tools that standup a service quickly by serializing Domain models and sending them out as REST contracts are not good
Asynchronous Event-Based Collaboration
Choose how services will emit events, and how subscribers will listen for them
"Traditionally, message brokers like RabbitMQ try to handle both problems."
Can also keep track of what messages a subscriber has already seen, and handle error queues.
"These systems are usually designed to be scalable and resilient, but that doesn't come for free. It can add complexity to the development process, because it is another system you may need to run to develop and test your services."
But once it is running, it can be a very effective solution.
There are also options like ATOM - but the subscribers have to manage their queues themselves.
Complexities of Asynchronous Architectures
"For example, when considering long-running async request/response, we have to think about what to do when the response comes back. Does it come back to the same node that initiated the request? If so, what if that node is down? If not, do I need to store information somewhere so I can react accordingly?"
How many times can subscribers attempt to process a message before they should all give up?
How to manage your error queue?
What if you have multiple versions of the same message type active at once?
Services as State Machines
"Our customer microservice owns all logic associated with behavior in this context. When a consumer wants to change a customer, it sends an appropriate request to the customer service...Our customer service controls all lifecycle events associated with the customer itself." (See Encapsulation)
"We want to avoid dumb, anemic services that are little more than CRUD wrappers."
Reactive Extensions (Rx)
"Mechanisms to compose the results of multiple calls together and run operations on them."
Works with asynchronous communication.
"At its heart, Rx inverts traditional flows. Rather than asking for some data, then performing operations on it, you observe the outcome of an operation (or set of operations) and react when something changes."
"They allow us to abstract out the details of how calls are made, and reason about things more easily."
DRY and the perils of code reuse in a microservice world
(see Martin about Single Responsibility meaning "responsible to a single individual/business unit" - one reason to change)
"DRY more accurately means that we want to avoid duplicating our system behavior and knowledge."
"When you want to change behavior, and that behavior is duplicated in many parts of your system, it is easy to forget everywhere you need to make a change, which can lead to bugs."
Shared libraries can be deceptively dangerous in microservices. It is tight coupling.
"My general rule: don't violate DRY within a microservice, but be relaxed about violating DRY across all services."
"I've spoken to more than one team who has insisted that creating client libraries for your services is an essential part of creating services in the first place."
A danger is that business logic will end up in the client library, when it should have stayed in the service. This can happen when the same people code the service and the client library.
"If the client library approach is something you're thinking about, it can be important to separate out client code to handle the underlying transport protocol, which can deal with things like service discovery and failure, from things related to the destination service itself."
Access By Reference
"We need to embrace the idea that a microservice will encompass the lifecycle of our core domain entities, like the Customer...We should consider the customer service as being the source of truth for Customers."
Consider how long you can hold cached data before you need to refresh it.
"Whether you decide to pass around a memory [cache] of what an entity once looked like, make sure you also include a reference to the original resource so that the new state can be retrieved."
Defer It As Long As Possible
"The best way to reduce the impact of making breaking changes is to avoid making them in the first place."
"REST helps because changes to internal implementation detail are less likely to result in a change to the service interface." - Again, nothing in REST does this, it is still down to separating your API contracts from your internal models.
"Another key to deferring a breaking change is to encourage good behavior in your clients, and avoid them binding too tightly to your services in the first place."
Example: the client should only deserialize the fields they need, not everything by default.
Example: a client that can find a field, even when it moves around the data structure, will break less often.
Postel's Law (aka Robustness Principle): Be conservative in what you do, be liberal in what you accept from others.
Catching Breaking Changes Early
see Consumer-Driven Contracts
test early and often
see Semantic Versioning (Major.Minor.BugFix)
Coexist Different Endpoints
If you add a side-by-side new endpoint, then clients can gradually shift from the old to the new. Then you can remove the old one.
Expand and Contract Pattern: we expand the capabilities we offer, then contract once all clients are off the old version
Use Multiple Concurrent Service Versions
Run two versions of the service side-by-side and manage the traffic with routing/proxy/whatever
Downside - you have to support two codebases at once
Terminal vs Web vs Mobile
consider data usage and battery usage
consider how to provide different UIs to Web and Mobile
UI fragment composition
Backends for frontends
Integrating with 3rd party software
most companies need software that they do not have time to write and maintain themselves
"Build if it is unique to what you do, and can be considered a strategic asset; buy if your use of the tool isn't that special."
you will have little or no control of how 3rd party systems are written, supported, updated, or operated
you will have little ability to customize 3rd party tools
it's often a good idea to place a microservice between the 3rd party tool and the rest of your system, to handle all integration
Strangler Pattern: intercept messages to old system, decide if you want to re-route them to new system
Ch 5 Splitting the Monolith ================================
"How do we handle the fact that we may already have a large number of codebases lying about that don't follow these patterns? How do we go about decomposing these monolithic applications without having to embark on a big-bang rewrite?"
cohesion: keep things together that tend to change together
seam: a portion of the code that can be treated in isolation and worked on without impacting the rest of the codebase
"We want to identify seams that can become service boundaries."
- bounded contexts make excellent seams
- most programming languages provide namespace concepts that allow us to group similar code together - can indicate a seam
"We should identify the high-level bounded contexts that we think exist in our organization...Then we want to try to understand what bounded contexts the monolith maps to."
- create packages representing these contexts
- move the existing code into them
- left over code, that doesn't fit any context, will indicate more seams
- you can do this to a small portion of the system, it does not have to be all at once
"Our code should represent our organization, so our packages representing the bounded contexts in our organization should interact in the same way the real-life organizational groups in our domain interact."
- ex: the warehouse code should have no dependency on the finance department code
"I would strongly advise you to chop away at these systems. An incremental approach will help you learn about microservices as you go, and will also limit the impact of getting something wrong."
- where to start, once you have your seams?
- consider which seam will give you the most benefit when it is split into a microservice
- pace of change: you know that Inventory Management will be changing rapidly soon
- team structure: you have teams in different time zones - giving them different codebases to work on will help them be independent
- security: one bounded context needs tighter security than the rest
- technology: a team wants to write special algorithms in a specific language for one bounded context
- tangled dependencies: how much unwinding will it take to move this bounded context out of the monolith?
consider how entangled the different parts of your monolith are with one shared database
find seams in the database as you find seams in the codebase
you can start by dividing the database access code into different packages based on the bounded contexts
- watch out for foreign key constraints that cross boundaries
splitting the database will result in more database calls being made, since you can't all your data with a single JOIN
"Typically concerns around performance are now raised. I have a fairly easy answer to this: how fast does your system need to be? And how fast is it now? If you can test its current performance and know what good performance looks like, then you should feel confident in making a change."
"Also, we end up breaking transactional integrity when we move to two schemas..." This will take more thought about the design of the workflows.
Shared Static Data
recommendation to not store static (unchanging or very very rarely changing) data in a database - put it in a resource file instead
- faster lookup times
- self-documents that the data is stable
- put the shared data behind it's own service
- copy the shared data into each service that will need it
Both of these are usually over-kill
Shared Mutable Data
this shared data probably needs its own service
or the table is actually two distinct tables glues together, and it can be split with one piece going to each client service, with a foreign key between them
Staging the Break
- found the seams and separated code into different packages, still in the same monolith
- found the database seams and separated those out, too
- now you can pull one piece of code out into a microservice
transaction: these events either all happen together, or none of them happen
- very useful for keeping data in a consistent state through complex changes
what to do after dividing an operation into multiple transactions?
- Try again later: retry each step of the operation until each has succeeded (see Eventual Consistency)
- Abort entire operation: unwind the transactions that already succeeded (see Compensating Transaction)
- Distributed transactions: a way to NOT split your transaction - can manage multiple normal transactions within one overarching transaction that can cross system boundaries
Two-Phase Commit: first the voting phase, where each transaction (participant or cohort) says if it thinks its local transaction can complete. if all answer yes, then they are all told to commit. if any say no, they are all told to rollback.
- vulnerable to outages, where the process is interrupted in the middle, or a cohort cannot respond
- what if a commit fails after the voting phase?
- pending transactions will hold locks on resources, which can lead to contention - consider how this affects scaling of systems
"The various algorithms are hard to get right, so I'd suggest you avoid trying to create your own...Do lots of research on this topic if this seems like the route you want to take..."
consider the cost of data cleanup vs the cost of catching every edge case
"All of these solutions add complexity...Distributed transactions are hard to get right and can actually inhibit scaling...Systems that eventually converge through compensating retry logic can be harder to reason about, and may need other compensating behavior to fix up inconsistencies in data."
"If you do encounter state that really, really wants to be kept consistent, do everything you can to avoid splitting it up in the first place. Try really hard."
"If you really need to go ahead with the split, think about moving from a purely technical view of the process and actually create a concrete concept to represent the transaction itself." (see Making Implicit Domain Models Explicit)
Reporting with one data store
Even when all the data is already stored in one database (monolithic system), for performance issues, the reporting database is often separate.
Consider - should the main database and reporting database share the same schema? It makes it easy to mirror data, but you lose the efficiency of designing the reporting database for reporting use cases.
Reporting with many data stores
"In splitting a service into smaller parts, we need to also potentially split up how and where data is stored...The audience of our reporting systems are users like any other, and we need to consider their needs."
You can pull the data together using API calls. This works only for small queries.
Caching data in the reporting database has the usual dangers - the data may be out of date. If you cache on the service side, consider that "the nature of reporting is often that we access the long tail of data. THis means that we may well request resources that no one else has requested before, resulting in a potentially expensive cache miss."
"Reporting systems also often rely on third-party tools that expect to retrieve data in a certain way, and here providing a SQL interface is the fastest way to ensure your reporting tool chain is as easy to integrate with as possible."
You may need to write APIs just for reporting purposes. (See Batch APIs) Also consider asynchronous batching: request a bunch of data, ping the system until your export is ready, pick up the file.
Each service pushes data updates to reporting.
Or "a standalone program that directly accesses the database of the service that is the source of data, and pumps it into a reporting database...We try to reduce the problems with coupling to the service's schema by having the same team that manages the service also manage the pump."
Event Data Pump
You could update reporting with event listeners. Danger here in coupling separate services.
Maybe you don't need all that data in one place, anyway. Consider distributed reporting.
"Small, incremental changes...allow us to better mitigate the cost of mistakes, but doesn't remove the chance of mistakes entirely."
- small cost: moving code within a codebase
- large cost: splitting a database
- large cost: rolling back a database change
Recommendation - think about and draw out your designs before you start. Consider the principles of object-oriented design, but at the service level.
It's ok for a service to grow so large it needs to be split. That's a cycle. "The key is knowing it needs to be split before the split becomes too expensive."
Cycle Time: the time it takes to get one change through development all the way to production
Continuous Integration (CI)
every time code is checked in, the server compiles it and runs tests to verify it is in a valid state
this requires that you do have a suite of automated regression tests
"The core goal is to keep everyone in sync with each other."
if any artifacts are created, they are only created once per version of the code - "This is to avoid doing the same thing over and over again, and so that we can confirm that the artifact we deployed is the one we tested."
the more frequently you merge changes together, the easier those merges will be
Microservices Plus CI
1) you can have 1 repo with all the different services in it
(con) any checkin causes all the services to rebuild
this can take awhile - especially running all the regression tests
it is unclear how many of these services now need to be deployed
(pro) you can checkin changes to multiple services at one time
(con) if you break the build, the whole pipeline is dead
this is not a good long-term solution, but it can get a new project started
2) you can have 1 repo with all the different services in it PLUS configure the CI tool to recognize different folders as different services
fixes several problems with the last approach
"On the one hand, my check-in/check-out process can be simpler as I have only one repository to worry about. On the other hand, it becomes very easy to get into the habit of checking in source code for multiple services at once, which can make it equally easy to slip into making changes that couple services together."
3) 1 repo per service, with one CI build per repo
Build Pipelines and Continuous Delivery
you can specify what order to run build steps in
ex: run the fast tests first, so if they fail we don't have to run the slow ones at all
"This build pipeline concept gives us a nice way of tracking the progress of our software as it clears each stage, helping give us insight into the quality of our software."
build pipelines can end with an automatic production deployment, after all the automated tests pass
some CD tools can handle manual steps, where the process waits until a user verifies that the step is complete (such as UAT user acceptance testing)
Continuous Delivery (CD)
"The approach whereby we get constant feedback on the production readiness of each and every check-in, and furthermore treat each and every check-in as a release candidate."
Artifacts and Environment
CD may not just move an artifact (like a JAR file) along the pipeline. it might need to setup the environment as well. see automated configuration management tools.
"One way to avoid the problems associated with technology-specific artifacts is to create artifacts that are native to the underlying operating system."
pros: the OS can handle a lot for you: installation, event logging, dependency installation/resolution
cons: it can be hard to publish to a format the OS understands
cons: extra difficult if you are deploying to multiple OS's
if you automatically setup environments as needed
cons: it can take a while to setup a fresh environment - you can save time by only installing the software that is missing, but then you risk configuration drift, where environments slower move out of sync
a next step option is to deploy an environment image, that already has everything setup on it
a virtual machine image
pros: always start with a fresh, up-to-date installation
pros: much faster to spin up an environment
pros: some configuration tools support a single configuration that can create images for multiple OS's
cons: building the initial image takes a long time
cons: the image can take up a lot of memory (or bandwidth when transferring on the network)
extension: you can build an image with your software already installed on it - this is a single deployment artifact
mutable server: you deploy from source control, then someone manually changes configs on the server
immutable server: no manual changes allowed - all changes must go through the build pipeline
build the whole complete image as an artifact - maybe even go to the extent of disabling SSH
Our test and prod environments can vary greatly. "For example, our production environment for our service might consist of multiple load-balanced hosts spread across two data centers, whereas our test environment might just have everything running on a single host."
cons: situations commonly arise in production that are not being tested in test
"As you move from your laptop to build server to UAT environments all the way to production, you'll want to ensure that your environments are more and more production-like to catch any problems associated with these environment differences sooner."
"Sometimes the cost to reproduce production-like environments can be prohibitive, so you have to make compromises." This cost could be for third-part licenses, or it could be the deployment time slowing down your manual test loop.
configuration should be limited to things that change from one environment to another
try not to use configuration to drastically change functionality
how to handle this with CD?
1) create one artifact per environment, with the configuration included
- this breaks the pipeline, because the image tested is not the image deployed to production
- it takes more time to build all these artifacts
- how to build secrets into the images without having the secrets checked into source control?
2) manage configuration outside of the deployment images
How many services per host?
"host" means "a generic unit of isolation - namely, an operating system onto which I can install and run my services"
1) multiple services per host
- easier host management, because fewer hosts
- cheaper, because fewer hosts (even if they are virtual, this is cheaper)
- makes monitoring more difficult (ex: how to know which service is using the CPU?)
- services can affect each other (ex: using up OS resources)
- deployments can be harder (ex: services with different contradictory dependencies)
- limits deployment artifact strategy options (ex: cannot deploy a full image to the machine to update one service)
2) application containers
ex: IIS for .Net services
the container handles a lot of the issues mentioned in option (1)
the trade off is that all your services have to use the same technology (.Net in this example)
monitoring individual services can still be very difficult
3) single service per host
much more practical these days, when it is easy to spin up virtual environments on demand
strongly recommended that you not use microservices if you cannot deploy one service per host
PaaS (Platform as a Service)
"Most of these platforms rely on taking a technology-specific artifact, such as a Java WAR file or Ruby gem, and automatically provisioning and running it for you...At the time of writing, most of the best, most polished PaaS solutions are hosted...It doesn't just handle running your service, it also supports services like databases in a very simple fashion."
when they work, things are good
when they don't work, you have limited options for getting into the system to fix it
you can handle a small number of machines manually, but you need automation to handle many microservices hosted in many environments and on many hosts per environment
"Picking technology that enables automation is highly important. This starts with the tools used to manage hosts. Can you write a line of code to launch a virtual machine, or shut one down? Can you deploy the software you have written automatically?"
provides a virtual cloud on your laptop, so you can run all the services at the same time easily
useful for dev and test
can be resource intensive
Docker apps are like Virtual Machine images
"Whatever underlying platform or artifacts you use, having a uniform interface to deploy a given service is vital. We'll want to trigger deployment of a microservice on demand in a variety of different situations, from deployments locally for dev and test to production deployment...We'll also want to keep our deployment mechanisms as similar as possible from dev to production, as the last thing we want is to find ourselves hitting problems in production because deployment uses a completely different process."
"...the most sensible way to trigger any deployment is via a single, parameterizable command-line call."
Types of Tests
from "Agile Testing" by Crispin and Gregory
Unit Testing: automated: did we build it right?: supports programming and is technology facing
Acceptance Testing: automated: did we build the right thing?: supports programming and is business facing
Exploratory Testing: manual: how can I break the system?: critiques the product and is business facing
Property Testing: tools: response time, scalability, performance, security...: critiques the product and is technology facing
recommends automating as much manual testing as possible, to support fast, frequent, and consistent testing
Test Pyramid from "Succeeding With Agile" by Cohn
unit tests: lots and lots of them: isolates errors, fast for devs to run as they develop
service tests (api, integration): some of them
UI tests (end-to-end): a few of them: verifies broad user functionality is working, gives confidence in the whole system
"typically test a single function or method call"
"on modern hardware you could expect to run many thousands of these in less than a minute"
"the primary goal of these tests is to give us very fast feedback about whether our functionality is good"
should catch most of your bugs
supports refactoring by verifying that all existing functionality still works the same
Service Tests (has many different names)
test the services without using the UI
test one service at a time - isolate where bugs are located
they are best when you can isolate them from the rest of the system by stubbing out their dependencies
"Our service test suite needs to launch stub services for any downstream collaborators, and configure the service under test to connect to the stub services."
End To End Tests
test the whole system together, including the UI
when these pass, you have a lot of confidence in the whole system
run a lot slower, so you can't test every possible path
"When broader-scoped tests like our service or end-to-end tests fail, we will try to write a fast unit test to catch the problem in the future."
large-scope tests are often flaky or brittle - they can require a lot of maintenance
if a test suite is slow, people won't run it often
this can affect your entire development cycle - if a commit is slow because of these tests it is running, people will make fewer, larger commits - you don't want that
support fast feedback cycles by keeping all test suites running in a satisfactory amount of time
"A test suite that takes all day and often has breakages that have nothing to do with broken functionality are a disaster."
you must actively remove tests that are no longer needed
"It is hard to have a balanced conversation about the value something adds versus the burden it entails...Do you get blamed if you remove a tests? Maybe. But you'll certainly get blamed if a test you removed lets a bug through...[But] if the same feature is covered in 20 different tests, perhaps we can get rid of half of them,as those 20 tests take 10 minutes to run."
stub service: a service that responds with canned responses
see "Growing Object-Oriented Software, Guided by Tests" by Freeman and Pryce
see Mountebank - a project that stubs out services. "When it launches, you send it commands telling it what port to stub on, what protocol to handle, and what responses it should send when requests are sent."
Flaky or Brittle Tests
flaky: sometimes pass, sometimes fail, with no discernible difference between test runs
brittle: broken easily and often due to small code changes
when a dev knows or thinks a test is flaky, they'll keep running it instead of digging into what the problem might be; or they'll move forward in spite of the failing tests
"Martin Fowler advocates the approach that if you have flaky tests, you should track them down and if you can't immediately fix them, remove them from the suite so you can treat them...See if you can rewrite them to avoid testing code running on multiple threads...etc"
Who Writes Tests?
if one team owns the code under test, they should write the tests
if multiple teams own the code under test (likely for end-to-end-tests)
(1) don't just let everyone write tests - there will be missing tests, there will be duplicated tests, it will be messy
(2) don't just have a test team write the tests - the devs need to be involved in the test process, and to not become antagonistic to the people who are finding bugs; and devs should not sit idle waiting for someone else to write tests for their code
The Great Pile-Up
end-to-end tests are slow > something is broken > the pipeline halts until the problem is fixed > more changes have piled up waiting for end-to-end tests > it is more likely that another problem will turn up, and with all the changes, it will be harder to find the bug > loop
"A key driver to ensuring we can release our software frequently is based on the idea that we release small changes as soon as they are ready."
this discusses deploying multiple services's changes at the same time
"By versioning together changes made to multiple services, we effectively embrace the idea that changing and deploying multiple services as once is acceptable...In doing so, we cede one of the main advantages of microservices: the ability to deploy one service by itself, independently of other services...All too often, the approach of accepting multiple services being deployed together drifts into a situation where services become coupled."
Test Journeys, Not Stories
"[It is a trap] to add a new end-to-end test for every piece of functionality we add...The best way to counter this is to focus on a small number of core journeys to test for the whole system. Any functionality not in these core journeys needs to be covered in tests that analyze services in isolation from each other."
"These journeys need to be mutually agreed upon, and jointly owned...high-value interactions and very few in number."
"[few in number]: very low double digits even for complex systems"
Consumer Driven Contract
"We are defining the expectations of a consumer on a service...captured in code form as tests...If done right, these CDCs should be run as part of the CI build of the service/producer, ensuring it never gets deployed if it breaks one of these contracts."
these are API tests - run against a single service in isolation
"A good practice here is to have someone from the producer and consumer teams collaborate on creating the tests."
"[The CDCs] become the codification of a set of discussions about what a service API should look like...when they break, they become a trigger point to have conversations about how that API should evolve."
End to End Tests
"From speaking to people who have been implementing microservices at scale for a while now, I have learned that most of them over time remove the need entirely for end-to-end tests in favor of tools like CDCs and improved monitoring. But they do not necessarily throw those tests away. They end up using many of those end-to-end journey tests to monitor the production system using a technique called Semantic Monitoring."
"You can view running end-to-end tests prior to production deployment as training wheels...a useful safety net, where you are trading off cycle time for decreased risk."
Separating Deployment from Release
Regarding adding more and more tests to catch more cases "At a certain point we have to accept that we hit diminishing returns with this approach. With testing prior to deployment, we cannot reduce the change of failure to zero."
An alternative to testing before deployment "If we can deploy our software, and test it in situ prior to directing production loads against it, we can detect issues specific to a given environment...A common example of this is the Smoke Test Suite, a collection of tests designed to be run against newly deployed software to confirm that the deployment worked. These tests help you pick up any local environmental issues."
blue/green deployment: "We have two copies of our software deployed at a time, but only one version of it is receiving real requests." The new version is deployed, smoke tests are run against it, and then the production load is switch from the previous version to the new version. This also reduces production downtime because you just change where the traffic stream is directed after the new services are already running.
canary releasing: direct a little production traffic as the new deployment. If the results are satisfactory, gradually increase the load until no traffic is going to the old deployment. Some teams send a copy of the production traffic to the new deployment, instead of fully redirecting it.
"Using these approaches is a tacit acknowledgment that we cannot spot and catch all problems before we actually release our software."
Mean Time to Repair (MTTR) Over Mean Time Between Failures (MTBF)
"Sometimes expending the same effort into getting better at remediation of a release can be significantly more beneficial than adding more automated functional tests."
- could be as simple as good monitoring coupled with fast rollbacks
"Most organizations that I see spending time creating functional test suites often expend little to no effort at all on better monitoring or recovering from failure. So while they may reduce the number of defects that occur in the first place, they can't eliminate all of them, and are unprepared for dealing with them if they pop up in production."
definitely consider how costly a single error is to your particular business, some businesses are more forgiving than others
nonfunctional requirements: latency, load, accessibility (to the disabled), security, etc. Anything that cannot be implemented like a normal feature.
cross-functional requirements is another term for nonfunctional requirements "It speaks more to the fact that these system behaviors really only emerge as the result of lots of cross-cutting work."
"Many, if not most, CFRs can really only be met in production." But you can still write lower-level non-prod tests that are helpful here
"For example, once you've found a performance bottleneck in an end-to-end load test, write a smaller-scoped test to help you catch the problem in the future."
"When decomposing systems into smaller microservices, we increase the number of calls that will be made across network boundaries. Where previously an operation might have involved one database call, it may now involve three or four..."
"Tracking down sources of latency is especially important."
since these tests often involve simulating a lot of load on the system, and running multiple iterations, they can take a while to run
"It isn't always feasible to run them on every check-in. It is a common practice to run a subset every day, and a larger set every week."
Chapter 8 Monitoring ===============================
"The answer here is pretty straightforward: monitor the small things, and use aggregation to see the bigger picture."
Single Service, Single Server
"We'll need to monitor the host itself, CPU, memory...We'll want to know what they should be when things are healthy, so we can alert when they go out of bounds."
"We'll want to have access to the logs from the server itself...We can probably get by with just logging on to the host and using command line tools to scan the log."
"We might want to monitor the application itself. At a bare minimum, monitoring the response time of the service is a good idea."
Single Service, Multiple Servers
"We still want to monitor all the same things as before, but need to do so in such a way that we can isolate the problem."
"We still want to track the host-level metrics, and alert on them." But we also want to see them aggregated across all hosts.
We'll want all the logs available in one place.
We'll need to monitor the load balancer that is serving these Multiple Servers. And "we'll configure our load balancer to remove unhealthy nodes from our application."
Multiple Services, Multiple Servers
This requires "collection and central aggregation of as much as we can get our hands on, from logs to application metrics."
"It can be hard to know what 'good' looks like when we're looking at metrics for a more complex system...The secret to knowing when to panic and when to relax is to gather metrics about how your system behaves over a long-enough period of time that clear patterns emerge."
Aggregate per service, and aggregate for the whole system
some monitoring systems will save aggregates of old data (or samples), so you don't fill up your memory with old, detailed data
"Make sure you can get access to the raw data to provide your own reporting or dashboards if you need to."
"Another key benefit to understanding your trends is when it comes to capacity planning. Are we reaching our limit? How long until we need more hosts?"
"I would strongly recommend having your services expose basic metrics themselves. At a bare minimum, for a web service you should probably expose metrics like response times and error rates - vital if your server isn't fronted by a web server that is doing this for you."
"For example, our accounts service may want to expose the number of times customers view their past orders, or your web shop might want to capture how much money has been made during the last day."
"There is an old adage that 80% of software features are never used...Wouldn't it be nice to know what they are?"
Knowing how customers use your system, you can improve it where it matters.
"We can never know what data will be useful...I tend to err toward exposing everything and relying on my metrics system to handle this later."
Synthetic Monitoring aka Semantic Monitoring
"What we actually want to track...is the system working? The more complex the interactions between the services, the further removed we are from actually answering that question."
"So what if our monitoring systems were programmed to act a bit like our users and could report back if something goes wrong?"
example from a banking/stocks system: generate fake requests and monitor how the system handles them. If the response time is too low, report that as an error.
- this is a Synthetic Transaction
Given one call to your system that generates many downstream effects, how do you correlate all those log messages together?
Generate a GUID when a call first enters you system, and pass it to all downstream calls.
Correlation Ids are easier to add from the beginning of design then to insert later.
"Imagine a situation where the network connection between our music shop website and the catalog service goes down. The services themselves appear healthy, bu they can't talk to each other. If we just looked at the health of the individual service, we wouldn't know there is a problem."
"Therefore, monitoring the integration points between systems is key."
"Each service instance should track and expose the health of its downstream dependencies, from the database to other collaborating services."
"One of the ongoing balancing acts you'll need to pull off is where to allow for decisions to be made narrowly for a single service versus where you need to standardize across your system...In my opinion, monitoring is one area where standardization is incredibly important."
"You should try to write your logs out in a standard format." Ex: using the same name for the same metric across systems.
"The key is making it easy to do the right thing."
Consider the Audience
"All this data we are gathering is for a purpose. More specifically, we are gathering all this data for different people to help them do their jobs; this data becomes a call to action."
"Consider the following: what do they need to know right now? what might they want to know later? how will they consume the data?...Alert on the things they need to know right now...Give them easy access to the data they need to know later...And spend time with them to know how they want to consume data."
"Historically, the idea that we can find out about key business metrics a day or two later was fine, as typically we were unable to react fast enough to this data to do anything about it anyway. Now, though, we operate in a world in which many of us can and do push out multiple releases per day."
"In such an environment, we need all our metrics at our fingertips to take the right action."
Chapter 9: Security ===========================================
"We need to think about what protection our data needs while in transit from one point to another, and what protection it needs at rest."
Authentication and Authorization
"Authentication is the process by which we confirm that a party is who she says she is." For example, by having them enter their username and password.
Principal: who or what has been authenticated
"Authorization is the mechanism by which we map from a principal to the action we are allowing her to do."
"When it comes to distributed systems...we don't want everyone to have to log in separately for different systems, using a different username and password for each....The aim is to have a single identity that we can authenticate once."
Common Single Sign-On Implementations (SSO)
ex: SAML, OpenID
"When a principal tries to access a resource, she is directed to authenticate with an Identity Provider"...Once the Identity Provider is satisfied that the principal has been authenticated, it gives information to the Service Provider, allowing it to decide whether to grant her access to the resource."
"This Identity Provider could be an externally hosted system..."
Okta is an Identity Provider that uses SAML
Single Sign-On Gateway
"Rather than having each service manage handshaking with your Identity Provider, you can use a Gateway to act as a proxy, sitting between your service and the outside world. The idea is that we can centralize the behavior for redirecting the user and perform the handshake in only one place."
"We still need to solve the problem of how the downstream service receives information about principals..."
ex: add principal info to HTTP headers
"Be careful...Gateway layers tend to take on more and more functionality, which itself can end up being a giant coupling point."
once a user is authenticated, what functionality can they actually access?
"These decisions need to be local to the microservice in question."
ex: the user is allowed to issue refunds, but only up to $200
"Favor coarse-grained roles, modeled around how your organization works. Going all the way back to the early chapters, remember that we are building software to match how our organization works."
Service-to-Service Authentication and Authorization
a principal can be a service, rather than a user
1) "Allow everything inside the perimeter: just assume that any calls to a service made from inside our perimeter are implicitly trusted."
- "Should an attacker penetrate your network, you will have little protection against a typical man-in-the-middle attack."
- "I worry that the implicit trust model is not a conscious decision, but more that people are unaware of the risks in the first place."
2) "HTTP(S) Basic Authentication: allows for a client to send a username and password in a standard HTTP header...This is an extremely well understood and well supported protocol."
- should be done over HTTPS, so the login info cannot be read by outsiders
- the server will need to manage SSL Certificates
- encrypted traffic cannot be cached, so this limits your caching strategies
3) "SAML or OpenID Connect: if you are already using [this] as your authentication and authorization scheme, you could just use that for service-to-service interactions, too."
- you'll need to carefully structure and manage user credentials so they can be revoked at a granular level
4) "Client Certificates: make use of TLS (Transport Layer Security)...Each client has an X.509 certificate installed that is used to establish a link between client and server. The server can verify the authenticity of the client certificate, providing strong guarantees that the client is valid."
- more and more certificates are hard to manage
- "With all the complexities around the certificates themselves, you can expect to spend a lot of time trying to diagnose why a service won't accept what you believe to be a completely valid certificate."
- "Consider the difficulty of revoking and reissuing certificates..."
5) "HMAC Over HTTP"
- an alternative to HTTPS: HMAC is a hash-based messaging code used to sign requests.
- "The body request along with a private key is hashed, and the resulting hash is sent along with the request. The server then uses its own copy of the private key and request body to re-create the hash. If it matches, it allows the request."
- requests can be cached...because they are in plain text where anyone can read them
- the private key is never revealed, but it does need to be shared between the client and server somehow
- hashing may have lower overhead than HTTPS
- "This is a pattern, not a standard, and thus there are divergent ways of implementing it...Be aware of the difficulty of getting this stuff right."
6) "API Keys: allow a service to identify who is making a call, and place limits on what they can do."
- "Some [internal] systems use a single API Key that is shared, and use an approach similar to HMAC...A more common approach is to use a public and private key pair."
- "Typically, you'll manage keys centrally...The gateway model is very popular in this space."
- "Compared to handling a SAML handshake, API key-baesd authentication is much simpler and more straightforward."
The Deputy Problem
The client makes a request to one service. That service makes a request to a second service. Should the second service just trust calls from inside the system?
"There is a typical vulnerability called the Confused Deputy Problem, which in the context of service-to-service communication refers to a situation where a malicious party can trick a deputy service into making calls to a downstream service on his behalf that he shouldn't be able to...For example, as a customer, when I log in to the online shopping system, I can see my account details. What if I could trick the online shopping UI into making a request for someone else's details, by making a call with my logged-in credentials?"
1) implicit trust
2) check if the caller (by pre-authenticated identity) is allowed to make this request
3) pass the full login information along
Securing Data at Rest
It's important to secure persisted data as well.
"The easiest way you can mess up data encryption is to try to implement your own encryption algorithms, or even try to implement someone else's...Whatever programming language you use, you'll have access to reviewed, regularly patched implementations of well-regarded encryption algorithms. Use those!"
"Subscribe to the mailing lists/advisory lists for the technology you choose to make sure you are aware of vulnerablities..."
Encrypt sensitive data
Salt and hash passwords
Where to store the encryption key? "One solution is to use a separate security appliance to encrypt and decrypt data. Another is to use a separate key vault that your service can access when it needs a key."
Consider what data can appear in log files
"Encrypt data when you first see it. Only decrypt on demand, and ensure that data is never stored anywhere."
You're backups need to be as secured as the main database. Make sure you know which encryption key applies to which backup.
Defense in Depth
Firewalls: network traffic restrictions, "looks outward to stop bad things from getting in"
Logging: "can help with detecting and recovering from [attacks]"
IDS: Intrusion Detection System: "monitor networks and hosts for suspicious behavior, reporting problems when it sees them...actively looks inside the perimeter for suspect behavior..."
Network Segregation: "you can put [microservices] into different network segments to further control how services talk to each other."
- "Start with only running services as OS users that have as few permissions as possible..."
- "Patch your software. Regularly. This needs to be automated..."
(Lengthy example of security concerns)