The capabilities of artificial intelligence in the field of image generation continue to surprise. On this occasion, the protagonist is Point-E, an open source development carried out by OpenAI capable of generate 3D models of objects from a natural text description..
Three-dimensional models take just two minutes to generate with Point-E.
In the style of what had already been achieved with tools such as Dall.e, which generates images from a text description, Point-E manages to generate a 3D model of the described object so that it can be rotated in all directionsand also in a very short time, just one or two minutes using a single Nvidia V100 GPU, so that no stratospheric requirements are necessary.
One of the main differences of the 3D object generation model used by Point-E with respect to other tools is that for the representation of the volumes uses discrete sets of points, like clouds.which give shape to the object to be represented. This is the reason for its name, since the letter E in Point-E indicates “efficiency”, and when conjugated with the word “point”, it manages to define its operation.
And it is that from the computational point of view it is simpler to represent point clouds, although (and here is the great limitation of Point-E) the texture of the surface of the object is not defined with precisionwhich looks like what it really is, a cluster of small spheres. This can also have another undesirable consequence, which is that sometimes a small portion of the 3D object may be missing or distorted.
Behind Point-E there is a double technical combination. On the one hand the tool capable of translating text into two-dimensional images and then the one that converts a 2D image into a 3D model of that object. The basis of the training of both tools combined is the same: to receive images labeled with text in order to make the reverse path, adding to it the three-dimensional objects identified with their two-dimensional pairs in order to make also that other reverse transition.
The operation, therefore, would be as follows:
-Reception of a text description: “traffic signaling cone in orange and white colors”.
-Creation of the 2D image of the cone.
-Generation of a point cloud representing the cone.
-Obtaining of a 3D model of the traffic signaling cone described initially.