The first virtual Project Teams Gathering (PTG) was held on June 1-5 in all time zones around the globe providing the opportunity to anyone to join including contributors of the OSF Edge Computing Group (ECG).
The group had three sessions during the first three days of the week where we spent a total of seven hours to discuss topics relevant to edge computing use cases, technologies, and architectures. To take advantage of the virtual format, we also invited adjacent communities to participate like CNTT, OPNFV and the Kubernetes edge groups.
We started the sessions with the introduction of the work that the Edge Computing Group has been doing to define reference models and architectures that satisfy the requirements of most edge computing use cases and prepare for some common error cases. The main discussion point is the level of autonomy that an edge site requires which, among other things, affects the available functionality in case of losing the network connectivity towards the central data center. The two identified models are the Centralized Control Plane and the Distributed Control Plane.
The ECG has defined reference architectures to realize the above models with OpenStack services and started testing activities as well to verify and validate functionality. The purpose of the sessions at the PTG was to gather feedback about the models and to improve the reference architectures with adding new components and discuss the options to run all types of workloads at the edge.
We touched on TripleO’s Distributed Compute Node (DCN) architecture which is an example of the Centralized Control Plane model. Our discussions were circulating around challenges of edge deployments, such as latency: “100ms is a killer for distributed systems over WAN”; nodes getting out of sync can be a big issue. We also talked about improvements like Ceph being available since the OpenStack Train release compared to only ephemeral storage prior to that, and increased number of edge nodes that are running compute services and workloads.
We spent a longer amount of time discussing the Distributed Control Plane model which was in interest for the CNTT community as well therefore we discussed details about ways to implement this option. Some of the meeting participants have already been deploying OpenStack on edge sites which requires shrinking the footprint to prepare for limited hardware resources which is one of the common constraints of edge use cases. In case of running all the controller services at the edge, the resource usage can be a challenging factor, but it’s not an unsolved problem. Another popular option to discuss is the federated approach that is supported by components such as the OpenStack Identity Service (Keystone).
As an example to the distributed model, we had a short discussion about StarlingX and some of the design decisions the project has made to shape the project’s architecture. StarlingX is integrating well-known open source components such as OpenStack, Kubernetes, Ceph, etc into one platform along with services for software and hardware management that are developed by the community. During the PTG session, we discussed the Distributed Cloud feature in more details to understand how StarlingX manages the edge sites which can have full autonomy in case of network failures while still managed centrally. Discussion topics included understanding what is synchronized and shared between the nodes to ensure smooth operation in different scenarios and essential functionality for edge, such as zero touch provisioning.
StarlingX is running the majority of the platform services in containers and also provides the possibility to have edge sites with only container workloads in the architecture. The mention of containers lead the discussion towards better understanding the requirements towards container orchestration tools such as Kubernetes in edge infrastructure. We talked a bit about concepts such as Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Container as a Service (CaaS) and how the lines between these have started to disappear recently. The focus was on requirements on Kubernetes from a CaaS viewpoint while we also took a look at how it impacts the reference architectures. We need to understand storage and networking configurations as well as handling data crucial to run containers like quotas and certificates.
During the PTG sessions the participants took the opportunity to talk about storage which is an area that the ECG hasn’t gotten the chance yet to look into. We concentrated on object storage this time as the block storage strategies are a bit more straight forward. We agreed that the primary usage of object storage is to provide service for the applications, but it is useful for the platform too, like sharing images as a backend to the OpenStack Image Management (Glance) service. We had participants on the meeting from the OpenStack Object Storage (Swift) team to identify use cases and requirements for this component to take into account during the design and development process. The main use case we discussed was Content Delivery Networks (CDN) to leverage this functionality while online backup and gaming can also be considered. For design decisions we started to discuss architectural considerations and the effects of circumstances such as latency to the components of Swift.
As the PTG is a great opportunity to sync with project teams, we had joint sessions with the OpenStack Network Connectivity as a Service (Neutron) team and the OpenStack Accelerator (Cyborg) team. To also cover cross-community discussions, we had a session with the Airship project as well as KubeEdge from the CNCF community.
One of the ongoing discussions with Neutron is the handling of network segment ranges and to make sure they are defined and handled properly in the distributed environment in both architectural models. There is a feature request for Neutron that is already approved and has the potential to get priority during the Victoria release cycle. The first step is to put together a proof of concept deployment to test the planned configurations and changes. The Neutron team and ECG contributors will work on testing as a joint effort. A further relevant project is Neutron Interconnection which is mainly API definitions at this point as an effort to provide interconnection between OpenStack deployments and regions over WAN. Further networking related topics included functionality such as Precision Time Protocol that the StarlingX community is already working on along with Time Sensitive Networking (TSN).
The next cross-project session was the sync with the Cyborg team. This session was in big interest as the ability to use hardware accelerators is crucial for many edge use cases. During the session we were focusing on learning about the current activities within the project such as the implementation and next steps for the new v2 API. We also touched on device driver integration. Cyborg is concentrating on the ability of programming the acceleration devices made available to the applications and will not include low level device driver integration in the upstream code. The Cyborg team is working with Neutron, OpenStack Placement and OpenStack Compute (Nova) teams to ensure smooth integration and full functionality in these areas.
During the sync sessions we were also focusing on relevant topics such as lifecycle management of the platform services. One of the main challenges is to handle upgrades throughout the whole edge infrastructure which can be a big headache in massively distributed systems. In some use cases downtime is almost unacceptable which means that we need to prepare the edge site to have enough resources to keep services running while you are performing an upgrade. When it is not doable, we need to identify processes to minimize the time when the services are unavailable.
In connection to this theme we were talking to the Airship community as the project’s main mission is to address deployment and lifecycle management of software such as OpenStack and Kubernetes and therefore can be an option to address the aforementioned challenges. Airship is utilizing more and more components from the CNCF landscape as their concepts are using containers heavily for flexibility. For in place upgrades, Airship will use Cluster API and the concept of dual box deployment at edge sites which would ensure that there is always a replica of each service that provides availability during an upgrade process.
Our last session was with the KubeEdge project which is focusing on the usage of Kubernetes for edge computing use cases. It is built on top of Kubernetes with extensions such as application management, resource management and device management for IoT devices. Their main focus areas are IoT, CDN and Multi-access Edge Computing (MEC) scenarios. The project is releasing every three months and has an increased focus on IoT protocols. Their architectural model follows the Centralized Control Plane and running only worker nodes at the edge. As we had limited time during the PTG, we agreed to follow up with the collaboration after the event as well to work together on providing better solutions for edge computing use cases.
After the PTG we have a lot of action items to follow up on to evolve the architecture models as well as to help projects and adjacent communities with our learnings that they can use when they are working on their architecture solutions. We will keep looking closer into the object storage use cases and requirements to add that to our reference models and we are also working on setting up discussions with further Kubernetes groups such as the Kubernetes Edge and IoT group. We will also follow up on relevant implementation work and start testing activities when the software components are ready.
As a project highly relevant for edge, you might also be interested in checking out the StarlingX virtual PTG summary to learn about the community’s current activities and plans for the upcoming releases.
If you are interested in participating in the OSF Edge Computing Group, sign up to our mailing list and check out our weekly calls. You can find more information about the group’s activities on our wiki page. For a more cohesive summary of our work read the new white paper the group just published!
- Is It Edge or Just a Piece of a Large Distributed System? - December 6, 2022
- What does it take to bring and operate your edge in production? — Day 2 - May 23, 2022
- What does it take to bring and operate your edge in production? — Day 1 - May 16, 2022