Service's hash

proposal

#1

Related to

In this topic, I would like to fix the different issues we have about the hash of a service and find solutions that can be implemented step-by-step.

The ultimate goal is to have:

1. same hash across any computer
This feature is really important for the future decentralized network. It will be the unique identifier used in order to dispatch executions across multiple Core. That’s how the network will be able to know which service is running, on which core, and select the right Core to execute and verify executions.

Requested by @krhubert:
6. same hash even with different deployment method
The hash should be the same even if the service is deploy from different method: local dir, git repo or tar archive (including compression).

But the hash should change when:

2. source code change
means a potentially completely different service.

3. service definition change
means a potentially completely different service.

4. env variables change
could drastically change the behavior of a service (eg Ethereum with the different networks. same source code, but the data are completely different, thus the behavior of the service is also different).

5. dependencies change (OS version, apt-get, npm i, go mod, etc…)
this is more subtle but should also be taking into account: a different version of a dependency could also change drastically the behavior of the service.
For instance, a package manager using semver could break when installing a new version that actually break the dependency. (But it should be fine when locking version or using having a ‘lock’ file (eg: package-lock.jso)). Another example, running an apt-get upgrade can also install different version of system libraries.

Solutions

- Docker image

:x: #1, #4, #6
:white_check_mark: #2, #3, #5
Docker image hash are absolutely non-deterministic and change anytime a change type #2, #3, #5 occur. Even more, rebuild from the same source without docker cache or on another computer will create a different image hash.

- Docker image + hash of env variables (current implementation)

:x: #1, #6
:white_check_mark: #2, #3, #4, #5
In this improved version of the previous one, the env variables are taking into account so it solves #4. But goal #1 is still not met.

- Checksum of source + hash of env variables

:x: #5
:white_check_mark: #1, #2, #3, #4, #6
The PR #731 calculate the hash based on the source code, the env variables, and the service definition. It calculates the same hash across computer. BUT it doesn’t take into account the dependencies change.

Download already built docker image

:x: #2, #3
:white_check_mark: #1, #4, #5, #6
Same way as Docker Hub is working, the developer build its image and publish it on a image repository.
From @Anthony: Docker can also generate a tarball that is easy to distribute with the command docker image save. But the size of the image could be way higher than the source.
User can simply download the already build image.
The big problem is there is no way to guarantee the image is actually running the service’s source code. A developer can publish an image that is not build with the same source code.
One way to solve this will be to “control” the build process either by forcing to use the core or to use a public CI. In both case, there are many flaws and limitations.

Deterministic image build + hash of env variables

:white_check_mark: #1, #2, #3, #4, #5, #6
The ultimate solution could be fixed by using another tool than Docker build to create the image in a deterministic way.
I found out that it’s possible but may require to use “complicated” tools and could take weeks to actually implement.


  • I would especially like to have your feedback on the prioritization of #5. I feel it’s important but make the calculation of the hash dramatically more complicated.
  • Also please fix the proposed solutions if I made mistakes.

MESG Core v0.9 Release Notes
#2

Another way to have full deterministic hash even with docker is to save the docker image built with docker image save, this generates a tarball that we can shasum and distribute increasing considerably the size of the service but this image can then be imported into docker with exactly the same layers (apt-get, npm install etc… will be exactly the same)

Something I want clarification on here is what do we want to be deterministic (and I think there is confusion here).

  • The hash of the service based on an “import”
  • The artifacts generated for a service for an “export”

I feel we are confusing this and that’s why this PR https://github.com/mesg-foundation/core/pull/731 is getting messy.

For me the priority now is to manage the import (because we control the export with the marketplace). We need to make sure that one tarball received will always generate the exact same hash. If the tarball received is different then it’s ok to have a different hash in that case. And for this case only the tarsum of the tarball + env variable is sufficient.

For the “export” part I agree it needs a lot more work with the solutions you described or the save / import from docker but this is a different step


#3

I will include it in solution “Download already built docker image”.

So, if I understand it correctly, you think goal #5 dependencies change (OS version, apt-get, npm i, go mod, etc…) is not a priority?

If we reduced the service’s hash to just the checksum of the source, a lot of uniques properties of the hash will disappear. It will not accurately represent the service because there will be some many variable omitted.


Updates:
Maybe it’s enough for now to go back to a “simple” checksum on the source and not using the actual Docker Image hash. The decentralized network is not implemented, so having not exactly the same running services across computer with the same hash is fine for now.
If we go with this direction, the PR 731 actually resolves the requirements + give a bit more flexibility by also implementing 6. same hash even with different deployment method.


#4

So how we should treat deploy from git or a local directory (if the hash of the tarred directory will be different then it’s ok for you?). In that case, do we claim that we only support deploy for production from the marketplace, any other way of deploy is for dev|staging|on own responsibility?

note: docker image

If we are going to use docker image somehow then in the future we can’t replace docker container provider with a diffrent container system (or orchestrator like rocket, kubernetes, cloud providers etc…) easily. And I think we want to do this in a near feature so binding hash with docker is only temporary solution.

note: 5. dependencies change

Can we provide somehow the deterministic hash for this bullet? Using apt-get upgrade or npm install without lock file can result in different service. For this the only option is download something that is already built because I don’t see how any of another method can provide a deterministic hash, correct me if I’m wrong.

I just wanted to put those notes here, but I don’t have a solution for them in mind.

@Nicolas pointed out good argument

So maybe let’s use either checksum of tarball or use tarsum (as it provides #6) and we can back to the topic later when we start doing the network. Then we will have more requirements what we will need in implementation time.


#5

This is not highly critical but would be best to have the same hash all the time. As long as we import it the same way we do to export to the marketplace that should be fine. eg: remove .git, select unique folder and finally create tarball.

The tarball should have the exact same files and so the exact same hash. If there is any changes on the git/directory like permissions or last update header then this is a different version and the tarball will be different.

I don’t see any reason to try to have the same tarball if the headers changes except if i’m missing some knowledges about header and for example on linux there is an update on the files by the os (for example last_cache, last_read…)

Agree :+1: I would use the checksum of the tarball except if there is good reason to use the tarsum


#6

Ok, with current needs let’s go with a checksum of the tarball :+1:


#7

I agree with you but let’s be careful with one thing:
I think it’s super important to have the same service’s hash saved in the marketplace and calculated locally when deploying a service, so we can actually enforce a verification of the 2 hashes.
We have to make sure that during the publishing of a service on the marketplace, the hash is calculated the same way as the command deploy.
Because the Core is “cleaning” the service before calculating the hash, if the CLI simply calculate the hash of the tar published on the marketplace, then it may be different (no deletion of .git, etc…).

Here my suggestion is simply to deploy the service locally before publishing it, so the CLI can easily get the service’s hash. Also, if we change the calculation of the hash, the CLI will not need any update because it’s relaying on the Core to calculate it.
Do you agree with this? Do you see other way to do it?


#8

Moreover, with the same hash in marketplace, the Core could check way easier if the service exist on the marketplace and check for purchase.
Currently, because the hash are different, only the command service deploy can check if the service is purchased.


#9

:+1:

Deploying the service is a bit too much but agree that this could be a good way to validate this can be done the same way that the validation as proposed here (https://github.com/mesg-foundation/core/issues/191#issuecomment-465438396). A deploy that doesn’t persist, but only simulate the deployment.


#10

I’ll stick with the idea I put on the PR itself.

For me:

Exports

It doesn’t matter if dependencies (exports) change between different deploys. Because locking versions correctly is totally developer’s job. If they don’t do it for some reason and service becomes to buggy than it’s a low quality service.

Env Vars

We don’t need to include env variables to hash calculation. We actually, only aim to calculate hash from the source code itself. Overwriting env variables on deploy time does not change the service’s source code. So it’s hash shouldn’t change too.

Also, we don’t publish the new service to marketplace with different env var values applied during the deploy time. So, the service doesn’t need to have different hash depending on the custom env var values.

I’d like to move deploy time env variables to the service start time and maybe generate a new hash by combining service’s actual hash + env variables to easily identify that service for stoping it later on and so on.

With workflows, starting/depending on services will be like:

services:
  - graphql:
      sid: sid-or-just-tarball-hash
      env:
        - SCHEMA=`...type Users {...}...`

As you see it’s easier to provide env variables directly in the workflow. Workflow VM can apply this env variables to service on start time and save the unique service id or/and service hash calculated by combining the service’s hash + env variables in somewhere, to stop or delete it later on.

Conclusion

By moving env vars to start time and leaving control of exports to developers, we can simply calculate hashes directly from the service’s tarball without creating special hash calculation techniques.

On the marketplace publish time, devs can directly provide the tarball file or it’s URL for us to publish it to the marketplace. Or they can provide the source files locally on their computers and we can create a tarball, give a copy of it to the dev and then publish to marketplace.

Extra

Supporting Git or different tarballs

I think, there is no need to force keeping the same hash while deploying a service via git or from different tarballs tarballs that generated during the service’s publish time. Devs should just be get used to using marketplace like npm, brew or any other pkg managers. If they want to use a service from an unknown source, we can still provide the functionality of deploying services from tarball urls and Git but, checking the safety of that service is should be totally up to devs.

Using Docker for getting hash

I don’t quite like to use a hash calculated by Docker. Because it’s hard to regenerate that hash from the source code later on. And it’s hard to test the service’s security and know what’s going on inside if we directly publish and use the Docker image to marketplace instead of rebuilding it from the source code on deploy time.


#11

Integrating the env into the hash calculation have the advantage of running the same service’s source code with different envs like running an ethereum mainnet service and also an ethereum testnet service as the same time. This is a totally find approach.

Yes you are right, but we could imagine an hash verification only if deployed without env, or a bit more complicated, verifying the hash before applying the custom env.

This is the behavior of the command deploy. The command deploy is the one that actually create the service, so it creates the hash.
The command dev is compatible with env and should be use when testing different env.

We could replace the sid by the marketplace URL so the Core can download and deploy it automatically rather than manually by the user in advance.
Even if env variable are customizable, the Core could still check the service hash like I said in paragraph #2.

i agree but anyway this cannot be automatic because the Core doesn’t know the hash to check with.


#12

It is still possible to do this by setting env vars at start time. A new hash can be calculated by combining the service’s actual hash + env vars. Start command can output the new hash. By moving it to the start time we’ll not be complicating the hash calculating process for service itself anymore and we’ll be still able to get unique hashes as we do now but within the start command.

I cannot see a problem about moving env vars to start time.


#13

So a “temporary” hash will be created during deploy? Otherwise how to start a service? Only by SID? In this case this is not a valid solution as the hash should always be the “source of truth” and unicity.


#14

No no, we’re still going to use service sid or hash to start it within the start command, like we always do.

  • Deploy command will calculate a simple hash over the tarball and output it.

  • Start command will receive env variables and service’s sid or hash to start a service. It’ll calculate a new hash by combining deploy time hash + env vars and then output it. Start name is a bit confusing so I think this is where the confusing comes in. Think it like docker service create command, it also creates a new hash for the service.

hash should always be the “source of truth” and unicity

This is an another thing that where confusion also comes in. In my view, we don’t break this rule even if we move env vars to start command.

The hash we create with the deploy command is no temporary. It’s service’s true hash. And it doesn’t need to be temporary anyway, we need it.


#15

I really don’t like the idea of changing the hash after the service is deployed. I think it will bring a lot of confusion…