`/analyze` endpoint

The /analyze handler is used to analyze an image and return the result which can be used for indexing.

When adding documents to Flow, the images must be downloaded and analyzed before the processed data is actually indexed. Image analysis is a computationally intensive task. To speed up the indexing of large volumes of documents and avoid a reduction in search performance during indexing, you can separate image analysis from indexing.

Scalabe data workflow

Analyzing images is computational expensive and therefore a time consuming process compared to text processing. To avoid putting heavy load on your search server and to speed up the analysis step, you can split the analysis and indexing steps into separate processes and use multiple servers just for analyzing.

graph LR
  A[Client]  
  subgraph Analyze Cluster
  direction TB
  subgraph Server 1
  B{{Flow node /analyze}}  
  end
  subgraph Server 2
  B2{{Flow node /analyze}}  
  end  
  subgraph Server N
  B3{{Flow node /analyze}}  
  end  
  end
  D[(Database)]
  subgraph Search Server
  C{{Flow node /update}}
  end

  A <==>|1. Analyze Image| B
  A <--> B2
  A <--> B3
  A -.->| 2. Store Json for reuse| D
  A ==>|3. Index Json| C

First, analyze images using the /analyze endpoint.
- Use one or more servers, each with a running Flow Docker.
- Since these Flow nodes only analyze images and do not store data, you can add and remove servers as needed.
- The more servers the faster the processing (obviously ).
- Optionally, speed up analysis with the parameter modules.apply
Optionally, store the JSON in a database to reuse that data when re-indexing is necessary.
Then, index the JSON as value in the pseudo-field import using the /update endpoint.
- Indexing is now very fast because the heavy lifting has already been done.
- Indexing does not affect search performance anymore.

Which workflow to choose?

Given the information above, we suggest to use the scalable workflow (separate analysis and indexing) if:

you have >100,000 images to index,
you need to re-index often (your data or schema changes),
reduce load on search server (constant search times are important).

Import pre-analyzed image

To speed up indexing you can also import pre-analyzed image data using the outputs of the /analyze endpoint.

Send image to /analyze endpoint. Optionally speed up analysis with the parameter modules.apply.
Extract the analyzed image data as json string from outputs field of the response.
Add the json string as fieldvalue to the special field import when indexing the corresponding doc.

PythoncURL

Separate image analysis from indexing.

import requests

IMG_URL = "YOUR_IMAGE_URL"
FLOW_URL = "http://localhost:8983/api/cores/my-collection"
# 1. Analyze image
rsp = requests.get(FLOW_URL + "/analyze?input.url=" + IMG_URL)
pre_analyzed = rsp.json()["outputs"]
# 2. Index image
doc = {
    "id":"1",
    "image":IMG_URL,
    "import":pre_analyzed
}
requests.post(FLOW_URL + "/update", json=doc)

Add doc with pre-analyzed image data.

curl --request POST \
  --url 'http://localhost:8983/api/cores/my-collection/update' \
  --header 'Content-Type: application/json' \
  --data '  {
    "id" : "1",
    "image" : "YOUR_IMAGE_URL",
    "import" : "{\"color_cluster\":117,\"color_rerank\":5707186122717457336,\"color_isolated\":false,\"content_lopq\":\"HnjaneuchsAm\",\"content_rerank\":-4163907962861818406,\"color_palette_freq\":[0.24072231,0.22010717,0.21671125,0.18918492,0.14692917],\"color_palette_hex\":[\"#FDFDFD\",\"#FF8E3F\",\"#FF635B\",\"#FF7A4D\",\"#FF3A70\"],\"content_descriptor\":\"E61SS785+fDoUFqsGB3tBv/TTYs1lIhHAeMCPQbm6wIGpwx8k7wa8wMW3VYM38TgiKuBMT/slvPSzCX+l8/4I77tf1n32z1bDAAydPRy+E4mE1SBGdvZFGamU0EG7ALu7IHYLqY1Gh8/Bg4mwywc7IdE2DWkBwb6/ObHCyLQB8o=\",\"copyspace\":[8],\"duplicate_nclusters\":\"X1bpk08JXrXsdpoeQKw6nA==\",\"color_lopq\":\"daenygQBnoR1\",\"duplicate_lopq\":\"XzSGyemXKxkt\",\"color_names\":[\"orange\",\"red\",\"white\",\"brown\"],\"duplicate_rerank\":-7442502864752548956,\"content_nclusters\":\"HghAbA+D2XbdedHK6D1aGQ==\",\"color_descriptor\":\"fx0Bf+QD0g4QQBb8AroCJ7zwHVwOC0sH\",\"color_nclusters\":\"dQWESMEwfzHq3cZTk7FDHg==\",\"duplicate_descriptor\":\"XJc7jT9JMsxSS8T2AIGrxGxOs4GXFQztB5oK28C3V4ie7tUozMmdx9RXxqM8+Totzghkw4G16TLA9xvvIP/ZKfSB2Gd/w9UevsnoFhF/IWyB9/6B7Dv4TAt3Dq4jv+aq1/ZILw8ksA0fTb1PEDUA/gievD3qZDfnxPv0aN86JyiBgSIiFIHugQ4DURvTRIE1yQvDJYHLJfiuJYFu5Ox/nwjhNgkJLN/5CLxI3Db1egQu37fSNoUN66Admiae824g01swZPjkMfoL/dF/scGnWicqIiCTIGsEaQm6I9TsEcvlxNS/S0zBNuPSQaffTTo1vNksHDjBBLnm/ywjjxTiAn871OAd5hNJYvr1f5vof5jvYn7UOWbmDvG15BvZ2QGUjsrBKe0U1LwnsDVXFvLyoPHRgWH88Rh8l4wqADCfQrmR8y4Ba1V/wVFV/B8WYXfX3X/5DwTxq8S8ajK8kynnHir5+w8Zn2bhgR/4m3HMRsUfP+suVcrhHfL0quWB+Mb0VyLQ3Yx/Hs5g9i3pIcT7SGrs58LwgRPwDQj8A5vIfOrio4EjXeYPEyDY1y7T0bk2HLvn7ewdtT+CpMv0m6ojtfrlHpiBILT9gQT5xJJM/Q/UP/vv6ekTEH+LRBkd8Bdm5Zsf8T416BRiHLIlhkgJBkLjAwf5JNVL/evGKztLTvs=\",\"duplicate_cluster\":95,\"content_cluster\":30}"
  }'

If the special field import is present in the doc, Flow uses the import field as data source and not the image field. We recommend indexing the original image URL as well, to be able to display the image when using the HTML response writer or your own UI.

The special import field value is not indexed.

Upload image

PythoncURL

The following snippet loads the image and scales it down to thumbnail size in memory before uploading it to Flow. This has two advantages:

Python Pillow package supports many more image formats than the Java runtime used by Flow. By converting the image to png before uploading, you can easily handle a wide range of formats.
Downscaling the image before uploading significantly speeds up processing for high-resolution images by minimizing I/O and decoding times.
Optionally speed up analysis with the parameter modules.apply.

Upload thumbnail version to speed up image analysis.

import requests
import base64
from PIL import Image
from io import BytesIO

IMG_PATH = "YOU_IMAGE_FILE"
FLOW_URL = "http://localhost:8983/api/cores/my-collection"
# Load & scale to thumbnail size - HighRes images slow down IO tremendously
img = Image.open(IMG_PATH)
img.thumbnail((400, 400))
# encode as in-memory png
bytes = BytesIO()
img.save(bytes, format='PNG')
base64_image = base64.b64encode(bytes.getvalue()).decode('utf-8')
payload = {'input.data': base64_image}
# set empty logParamsList= to avoid logging huge log messages when uploading base64 images as parameters
rsp = requests.post(FLOW_URL+"/analyze?logParamsList=", data=payload)
pre_analyzed = rsp.json()["outputs"]
print(pre_analyzed)

The following command converts a given image file (png, jpg, gif supported) to base64 that is piped as value to the input.data parameter in curl and finally sent to the /analyze endpoint.

base64 --wrap=0 YOUR_FILE_PATH | \
curl --data-urlencode "input.data@-" \
--url 'http://localhost:8983/api/cores/my-collection/analyze?logParamsList='

Java API

To directly analyze images within Java, you can use the SimpleOutputProducer class which is included in the Flow jar file.

import java.net.URL;
import de.pixolution.api.SimpleOutputProducer;
import de.pixolution.api.json.JsonWriter;
import de.pixolution.storage.ServiceOutput;

public class JavaAnalyzeAPI {

    public static void main(String[] args) {
        // Note: Reuse the producer for best performance
        try (SimpleOutputProducer producer = new SimpleOutputProducer()) {
            URL url = new URL("https://website.com/my-image.jpg");
            // Download & analyze image
            ServiceOutput result = producer.calcOutput(url);
            JsonWriter writer = new JsonWriter();
            // jsonResult can be directly set in import field when indexing
            String jsonResult = writer.write(result);
            System.out.println(jsonResult);
        } catch (Exception e) {
            // Handle exception
        }
    }
}

/analyze endpoint